Wednesday, December 30, 2015

Lego Pricing: For which partnerships do consumers pay the most?

As for most parents out there, a Lego set was part of the Christmas wishlist, and I found myself in front of an impressive display of options. As I was looking through the boxes, I noticed something while comparing these two boxes:

Extra information: the two sets are identically priced at $39.99....
So it's not completely easy to spot on the boxes, but despite the same price, the Ninjago box contains twice as many pieces (575) as the Frozen one (292). Ninjago is a line of set produced by Lego and therefore owned by Lego, while Frozen is the result of a partnership with Disney. Quickly scanning the different boxes, I seemed to some trend there: similarly priced sets appeared to have fewer pieces for themes that were the result of external as opposed to internally derived.

Having some free time on my hands during the Christmas break, I extracted as much data as I could for the LEGO site, pulling for each set the theme, the number of pieces and the price. I was able to identify close to 700 sets which provides a reasonable size for exploring some trends. Here are all the data points with number of pieces on the x-axis and price on the y-axis, and some jitter was added but not particularly necessary (prices tend to take discrete levels but not number of pieces).

A few observations:
  • the data is densely concentrated around the origin, the outliers on the scale make it hard to determine what exactly is going on there
  • there appears to be quite some variability in number of pieces for a given price point, which confirms my initial impression from the Lego store. Looking at the $200 vertical line, we see that there are boxes at that price with fewer than 1000 pieces, and others with over 2500!
  • overall, the relationship seems pretty linear along the lines of pieces = 10 * price, every $1 gets you about 10 pieces. I was more expecting a convex shape where each incremental piece costs a little less than the previous one, similarly to Starbucks where the larger the drink, the better the size-to-price ratio). I guess this can somewhat make sense: with food/drinks, two one-size units are equivalent to a two-size unit (if a gallon costs too much I'll just buy two half-gallons), but two 300 pieces Lego sets are not equivalent to a 600 Lego set, and so I guess Lego can afford maintaining the linear relationship.
And if you're wondering about the two data points in the upper right corner:
  • at 3808 pieces and $399, we have the tough-to-find Star Wars Death Star
  • at 4634 pieces and $349, we have the Ghostbusters Firestation (to be released into 2016)

Let's focus a little more around the origin where most of the data resides (92% of sets are priced less than $100):

Along the x-axis there appears to be a category of sets (green dots) consisting of just a few pieces but priced incredibly high. These are actually of the Mindstorm category. They are actually very sophisticated Lego pieces allowing you to build robots containing touch / light sensors that are sold separately at high price points. In the rest of this post, we will exclude the Mindstorm category, as well as the Power Functions category for the same reason. The Dimensions category was also excluded given that the pieces, while not as sophisticated as for Mindstorm and Power Functions, were quite elaborate based on their interaction with the Playstation console (average pieces-to-price ratio is about 3).

There appears to be another category with it's own specific piece/price relationship (red dots). While overall it seemed that every $1 was equivalent to about $10 pieces, this category seems to have a steep $1 for 1.5 pieces. This is actually the Duplos category for younger children, and the pieces are much larger than regular Legos. That being said, I'm wondering if Lego isn't taking advantage of all the parents eager to give their toddlers a head start in the Lego environment... Duplos are also thrown out for the rest of the post.

Back to our original question, how do the different themes compare to each other, and is there a price difference between internal and external brands?
The following boxplot provides some insight in the pieces-to-price ratio within each category. I've sorted them by decreasing median (higher median is synonym with a 'good deal', many pieces for every dollar). I've also color-coded them based on whether the theme was internal (red) or external (blue) to Lego.

Glancing at the graph, the two main take-aways are that:

  • there is strong variability within each category (in Star Wars for instance, the Troop Carrier set has 565 pieces for $40, while Battle on Takodana has fewer pieces (409) for a 50% higher price)
  • there does nonetheless seem to be a trend that internal themes have a better pieces-to-price ratio
We can try to explore the difference between the two types of themes via a linear regression, with a different slope and different intercept for each type:

The conclusion of the regression analysis is that the slopes for the two lines is not statistically significant (9.67 pieces/$ for external brands, 10.15 pieces/$ for internal brands), but there was a significant difference in intercept (50 fewer pieces for an external brand at the same price).

So in summary, don't feel Lego is completely overpricing you Disney Princesses or Star Wars figurines although there is a small pricing difference. If you do want the biggest bang for your buck, take a look at the Creator theme, and in particular here's the overall pieces-to-price winner (which I ended up getting my kid!):

Happy building!

Tuesday, December 22, 2015

2015 Summer Blockbuster review

Summertime rhymes with school break, fireworks, BBQ, but just as inseparable are Hollywood's big blockbusters. Early 2015, even as early as late 2014 we had teasers and trailers for the big upcoming wave of big budget movies, many of which sequels to blockbuster sagas. Terminator 5, Jurassic Park 4 anyone?

Starting early summer I pulled daily stats for 20 of the most anticipated summer movies of 2015, and before we enter the new year, let's see how they did.

Rating Evolution

In these two earlier posts I looked at the evolution IMDB scores after movies' releases, as well as after Game of Thrones episodes aired. In both cases we observed a trend in rating decrease as time went by, although this phenomenon was much more sudden for TV episodes (a few days) than for movies (multiple weeks / months).

Here's the trend for our 20 blockbusters, all aligned according to release date, and titles ordered according to final rating:

Again, we observe the same declining trend, although the asymptote seems to be reached much sooner than for the average movie (earlier analysis).

Straight Outta Compton clearly emerges as the best-rated movie of the summer although it did not benefit from as much early marketing as most of its competitors. Straight Outta Compton also distinguishes itself from the other movies in another way. While all movies dropped an average of 0.3 rating points between release date and latest reading (not as dramatic as the 0.6 drop observed across a wider range of movies in the previous analysis already mentioned, as if summer blockbuster movies tend to decrease less and stabilize faster), Straight Outta Compton actually improved its rating by 0.1 (this is not entirely obvious from the graph, but the movie had a rating of 8.0 on its release date, jumped to 8.4 the next day, and slowly decreased to 8.1). Only two other movies saw their ratings increase, Trainwreck from 6.2 to 6.5 and Pixels from 4.8 to 5.7, the latter increase while quite spectacular still falls way short from making the movie a must-see, despite the insane amounts spent on marketing. As you might have noticed from my posts, my second hobby is basketball, and I remember this summer when not a day would go by without seeing an ad for Pixels on TV or on websites where NBA stars battled monsters from 1970 arcade games.

Which brings us to the next question: did budget have any effect on how well the movies did, either from a sales or rating perspective? Of course we are very far from establishing a causal model here so we will have to satisfy ourselves with simple correlations across five metrics of interest: IMDB rating, number of IMDB voters, movie budget, gross US sales and Metascore (aggregated score from well-established critics).

I would have expected the highest correlation to be between IMDB rating and Metascore (based on another analysis I did comparing the different rating methodologies). However, it came at second place (0.73) and I honestly had not anticipated the top correlation (0.87) between budget value and number of IMDB voters. Of course we can't read too much into this correlation that could be completely spurious, but it might be worth confirming again later with a larger sample. If I had to give a rough interpretation though, I would say that a movie's marketing spend is probably highly positively correlated with the movie's budget. So the higher the budget, the higher marketing spend and the stronger 'presence of mind' this will have on users who will be more likely to remember to rate the movie. Remember my example of all the Pixels ads? I didn't see the movie, but if I had, those ads might have eventually prompted me to rate the movie independently of how good it was, especially if those ads appeared online or even on IMDB itself.

But while we wait for that follow-up analysis, we can all start looking at the trailers for the most anticipated movies of next year, sequels and reboots leading the way once again: X-Men, Star Trek, Captain America, Independence Day...

Saturday, December 5, 2015

Is this movie any good? Figuring out which movie rating to trust

In October, FiveThirtyEight published a post cautioning us when looking at movie reviews "Be Suspicious Of Online Movie Ratings, Especially Fandango’s".

To summarize the post as concisely as possible, Walt Hickey describes how Fandango inflates movie scores by rounding up, providing the example of Ted 2 which, despite having an actual score of 4.1 in the page source, displays 4 and a half stars on the actual page. The trend is true across all movies, a movie never had a lower score displayed, and in almost 50% of cases displayed a higher score than expected.

So if Fandango ratings aren't reliable, it could be worth turning to another major source for movie ratings: the Internet Movie Database IMDB.

Pulling all IMDB data, I identified just over 11K movies that had both an IMDB rating (provided by logged-in users) and a Metascore (provided by, aggregating reviews from top critics and publications). The corresponding scatterplot indicates a strong correlation between the two:

In these instances, it is always amusing to deep-dive into some of the most extreme outliers.
To allow for fair comparisons across both scales (0-100 for Metascore, 0-10 for IMDB), we mapped both scales to the full extent of a 0-100 scale, by subtracting the minimum value, dividing by the observed range, and multiplying by 100.
So if all IMDB ratings are between 1 and 9, a movie of score 7 will be mapped to 75 ((7 - 1) / (9 - 1) * 100).

Here are the movies with highest discrepancies in favor of Metascore:
Movie Title IMDB Rating Metascore Genre IMDB (norm) Metascore (norm) Delta
Justin Bieber: Never Say Never 1.6 52 Documentary,Music 7.2 51.5 -44.3
Hannah Montana, Miley Cyrus: Best of Both Worlds Concert 2.3 59 Documentary,Music 15.7 58.6 -42.9
Justifiable Homicide 2.9 62 Documentary 22.9 61.6 -38.7
G.I. Jesus 2.5 57 Drama,Fantasy 18.1 56.6 -38.5
How She Move 3.2 63 Drama 26.5 62.6 -36.1
Quattro Noza 3.0 60 Action,Drama 24.1 59.6 -35.5
Jonas Brothers: The 3D Concert Experience 2.1 45 Documentary,Music 13.3 44.4 -31.2
Justin Bieber's Believe 1.6 39 Documentary,Music 7.2 38.4 -31.2
Sol LeWitt 5.3 83 Biography,Documentary 51.8 82.8 -31.0
La folie Almayer 6.2 92 Drama 62.7 91.9 -29.3

And here are the movies with highest discrepancies in favor of IMDB's rating:
Movie Title IMDB Rating Metascore Genre IMDB (norm) Metascore (norm) Delta
To Age or Not to Age 7.5 15 Documentary 78.3 14.1 64.2
Among Ravens 6.8 8 Comedy,Drama 69.9 7.1 62.8
Speciesism: The Movie 8.2 27 Documentary 86.7 26.3 60.5
Miriam 7.2 18 Drama 74.7 17.2 57.5
To Save a Life 7.2 19 Drama 74.7 18.2 56.5
Followers 6.9 16 Drama 71.1 15.2 55.9
Walter: Lessons from the World's Oldest People 6.4 11 Documentary,Biography 65.1 10.1 55.0
Red Hook Black 6.5 13 Drama 66.3 12.1 54.1
The Culture High 8.5 37 Documentary,News 90.4 36.4 54.0
Burzynski 7.4 24 Documentary 77.1 23.2 53.9

Quickly glancing through the table we see that documentaries are often the cause of the biggest discrepancies one way or another. IMDB is extremely harsh with music-related documentaries, Justin Bieber has two "movies" in the top 10 discrepancies. Further deep-diving in those movies surfaces some interesting insights: For those movies, approximately 95% of voters gave it either a 0 or 10, and the remaining 5% a score between 1-9. Talk about "you either love him or hate him!". Breaking votes by age and gender, provides some additional background for the rating: independently of age category, no male group gave Justin a higher than 1.8 average, whereas no female group gave less than a 2.5. As expected, females under 18 gave the highest average score 5.3 and 7.8 respectively for each movie, but with 4 to 5 times more male than female voters, the abysmal overall scores were unavoidable.

OK, I probably spent WAY more time than I would have liked discussing Justin Bieber movies...

Let's look at how these discrepancies vary against different dimensions. The first one that was heavily suggested from the previous extremes was naturally genre:

We do see that Documentaries have the widest overall range, but all genres including documentaries actually have very similar distributions.

Another thought was to split movies by country. While IMDB voters are international, Metascore is very US-centric. So for foreign movies, voters from that country might be more predominant in voting and greater discrepancies observed when comparing to US critics. However, there did not seem to be a strong correlation between country and rating discrepancy.

We do observe a little more fluctuation in the distributions when splitting by country rather than genre, but all distributions still remain very similar. The US has more extremes, but this could also be due a result from the fact that the corresponding sample size is much larger.

As a final attempt, I used the movie's release date as a final dimension. Do voters and critics reach a better consensus over time and do with see a reduced discrepancy for older movies?

The first thing that jumps out is the strong increase in variance of the delta score (IMDB - Metascore). Similarly to the United States in the previous graph, this is also most likely a result from sample size. While IMDB voters can decide to vote for very old movies, Metacritic doesn't have access to reviews from top critics in 1894 to provide Carmencita with a Metascore (although I'm not sure it would be as generous as the IMDB score of 5.8 for this one minute documentary of a movie whose synopsis is: 'Performing on what looks like a small wooden stage, wearing a dress with a hoop skirt and white high-heeled pumps, Carmencita does a dance with kicks and twirls, a smile always on her face.')

But the more subtle trend is the evolution of the median delta value. Over time, it seems that the delta between IMDB score and Metascore has slowly increased, from a Metascore advantage to an IMDB advantage. From a critics perspective, it would appear as if users underrated old movies and overrated more recent ones. However the increase in trend has stabilized when we entered the 21st century.

I couldn't end this post without a mention to Rotten Tomatoes, also highly popular with movie fans. While Metacritic takes a more nuanced approach to how it scores a movie based on the reviews, Rotten Tomatoes only rates it as 0 (negative review) or 1 (positive review) and takes the average. In his blog post, Phil Roth explores the relationship between Metascores and Rotten Tomatoes and, as expected, finds a very strong relationship between the two:

So to quickly recap, there is clearly a strong correlation between the two ratings, but with enough variance that both provide some information. For some people, their tastes might better align with top critics whereas others might have similar satisfaction levels as their peers. Personally, I've rarely been disappointed by a movie with an IMDB score greater than 7.0, and I guess that I'll continue to use that rule of thumb.

And definitely won't check Fandango.

Tuesday, November 24, 2015

NBA stock: 2016 is a highly volatile year!

Some things never change.
I haven't followed the new 2015-2016 NBA season as closely as I would have, but it's impossible to ignore that our two finalists from last years are still dominating their conferences, San Antonio keeps chugging along as it has always done since like forever, and similarly to last year the Lakers and 76ers are in for a terrible season if the first 20% of the season is any indication.

That being said, there is a rather lengthy list of surprises, good and bad:
  • The Knicks were plain awful last year but now have a winning record
  • The Rockets and Clippers gave us an intense Western Conference semifinals yesterday both playing at a very high level but both now have losing records
  • The Hawks dominated the Eastern Conference last year yet are now ranked 6th in that same conference
  • The Pelicans made the Playoffs last year but this year are playing just slightly better than the Lakers
  • The Jazz were ranked 11th in the Western Conference last year, and are now ranked second
I could go on, but you get the idea. Now of course, nobody would expect every season to be an exact replicate of the previous one, players get injured, players transfer, coaches come and go... So some volatility in rankings is expected, but the question is how much?

Going back up to the 2005-2006 season, I pulled final regular season rankings for each season and each team (taking relocations into account for the Sonics, Bobcats and Hornets), and looked at absolute change from year to year. For instance, the Toronto Raptors finished 10th of the Eastern Conference in 2012-2013, 3rd in 2013-2014, 4th in 2014-2015. This would therefore be counted as a change of 7, followed by a change of 1 in conference ranking. Rank changes were averaged across all teams for each year. Here's the evolution of average rank change across all teams:
It appears that my hunch was not entirely unfounded: as of today (2015-11-24), current rankings have never been as different from the previous year going back to 2005-2006! (of course the season has only kicked off, and we are not entirely comparing apples to apples). 2014 was a close second, when 6 out of 30 teams had their conference rankings change by 7 or more positions.

What if we were to split out results for each conference?
It appears that the Eastern Conference is typically much more volatile than than its western counterpart. Up until 2014-2015, the Western Conference had never had an average rank change exceeding 3.2, a value that the Eastern Conference exceeded 5 times in the last 9 years! But comparing the first 15 games of the 2015-2016 season to last year's final standings, we have an average rank change of 4, tying the maximum value ever observed in either conference.

To finish off, it would be interesting to put these values in context and evaluate how much carry-over there is from one season to the next. Is an average of 3 or 4 rank changes per team high? or low? Our baseline would be a completely randomized basketball association where players are completely reshuffled from one year to the next and so each year's ranking is entirely random. I ran 100,000 simulations to see what the expected number of rank changes would be.
It turns out that a value of 4 is not particularly extreme: in our purely randomized world, about 15% of seasons would be less volatile rank-wise than what we are witnessing today!

I've spent quite some time running basketball analyses, from the number of expected runs and the incremental value of home court advantage to trying to forecast game outcomes based on team performance, yet it seems my conclusion is always the same: there is so much statistics can uncover, no matter what approach you take there always seems to be a strong unexplained random component which makes every team, every season, every championship so unique!

Friday, October 16, 2015

A brief history of NBA runs: Do teams really 'get hot'?

"Cleveland with a 10-0 in the last 2:05...."
"Warriors answering with an 8-0 run over the end of the first and beginning of the second quarter..."

I've watched a number of games during this 2014-2015 season, Playoffs included, and couldn't help but notice the amount of runs being announced on screen. This was particularly true when Chicago went on long scoring droughts against Cleveland in the Eastern semifinals.

But a question that nagged me all this time and which I wanted to investigate a little further is whether these droughts - or runs depending on whose side you're on - are natural and expected, or on the contrary are influenced by external factors.

In a previous post I took a closer look at overtimes in the NBA and showed that because they are an equilibrium caught between two highly unstable states (team A losing by a handful of points on one side, team B losing by a handful of points on the other), overtimes are about three times more likely to occur than one would naively expect. Could the same be said for runs? Once a team has gone on an 8-0 run, is it more likely to push it to 10-0? The 8-0 run could be the result of one team having a much better lineup on the floor, or a player with a particularly hot hand (although the notion of hot hand is debatable, one I will probably look into in a future post). Or is the team on the bad end of the run more likely to score, perhaps by calling a timeout to stop the first team's 'mojo' or to set up a specific play with higher scoring probability?

The first step was to collect as much data as possible. I pulled from all available games (regular season and playoffs but excluding pre-season), all the way back to the 2009-2010 season.
For each game, I split it into a succession of 'runs' and for each I computed the number of possessions and points.
Consider for instance the first few minutes of Game 1 of this year's Western Conference Finals between the Houston Rockets and Golden State Warriors:

After parse the information, we would get something like:
  • Rockets had a 2-0 run over 45s
  • Warriors then had a 2-0 run over 15s
  • Rockets then had a 7-0 run over 1m51s
  • ...

Although points is what is always being reported and what everyone ultimately cares about, I decided to focus on number of scoring possessions instead. A 7-0 which is the results of seven consecutive trips to the freethrow line, with 1 out of 2 freethrow being made each time is very different from a three-pointer followed by another three-pointer on which the shooter is fouled and completes the four-point play. In the former case the scoring team needs to get (at least) 6 defensive stops, in the latter they need only one.

So the data would actually look like this:
  • Rockets score 2 points on 1 scoring possession
  • Warriors score 2 points on 1 scoring possession
  • Rockets score 7 points on 4 scoring possession
  • ...

So the question we are interested in is how many consecutive times can a team score uninterrupted?

Preliminary Graphs
Before we jump into any modeling, let us first look at the frequency of uninterrupted scoring possessions:

That's quite a nice shape! It seems the occurrence of every run is a little under half of the previous number. We can verify this ratio visually:

Indeed, the frequency of each run is a remarkably stable ~45% of the previous run frequency.

A natural question is whether there is a difference between regular season games and Playoff games? Defense is supposed to be cranked up on notch so are longer runs less frequent? In the following graph, the proportion of runs in Playoff games is represented via the red histogram, that of regular season games via the blue histogram, and the overlap is purple.

The little blue tip would suggest a little more runs of 1 scoring possession in the regular season (and hence a little more 2+ scoring possession runs in the Playoffs), but a Chi-Square test reveals no statistical significance in the difference between the two histograms.

We can also split runs by home and road team. Can homecourt provide an additional boost and extend runs?

It appears as if the home team is slightly more likely than the road team to have longer runs, and this time the difference (as small as it appears visually) is significant. Once the home team gets it going and gets the crowd involved, good things happen!

Yet another splitting option is by quarter. Perhaps a team hasn't got its rhythm in the first quarter and is more likely to suffer a run, whereas the defense is a little tougher in the fourth quarter thus limiting scoring opportunities. To avoid overlaying 5 histograms over each other (the fifth being for overtimes), I used lines instead:

Although the lines appear nearly identical on the graph, the Chi-Square did pick up a significant difference across the associated table, even when dropping runs in overtime (harder to get long runs in a 5 minute overtime than a 12 minute quarter).
But it turns out that longer runs are more likely in the fourth quarter than the first. Either the defense gets a little tired, or the losing team realizes that they need to step things up quickly to avoid picking up the L.

Having a better sense of how the runs behave, we can apply a little bit of modeling.
Let us assume that the two teams are of similar strength, and that when they have possession of the ball they both have the same probability p of scoring.
Team A just scored, team B now has possession of the ball. What is team B's probability of interrupting A's run? There are theoretically an infinite number of ways for that to happen, the pattern being quite obvious:
  • B scores (probability = p)
  • B misses, A misses, B scores (probability = (1-p)(1-p)p)
  • B missesA missesB missesA misses, B Scores (probability = (1-p)(1-p)(1-p)(1-p)p)
  • ...
  • (B misses, A misses) n times, B Scores (probability = (1-p)^(2n) * p)

Reminder: p is the probability of scoring on a team's possession, so it incorporates missing a shot but getting the offensive rebound and shooting again for instance.

Adding all the pieces yields the probability of team A's run to be interrupted:

Let's now look at the probability of extending the run by exactly n more possessions, which we will denote P(n). We will break up this probability as the probability of scoring one more time and then exactly (n-1) times to get the recurrence with P(n-1):
  • B missesA scoresA scores exactly (n-1) more times (probability = (1-p)p * P(n-1))
  • B missesA missesB missesA scores, A scores exactly (n-1) more times (probability = (1-p)(1-p)(1-p)p * P(n-1))
  • ...
  • (B misses, A misses) n times, B missesA scores, A scores exactly (n-1) more times (probability = (1-p)^(2n) * p(1-p) * P(n-1))
Adding everything up yields:

And so, realizing that P(interrupted) is actually P(0), we get the general formula:

Let's get a few curves for various values of p:

I've overlaid the empirical curve (blue curve in bold). It's a little difficult to spot as it is extremely close to the curve with p = 25%.

Here's the plot with just those two curves:

The similarity in the two curves is really impressive!
We can even refine the true value by fitting a model to identify the value of p which best fits our empirical data. The result is 24.2%.

But back to our initial problem? Recall that we were trying to determine whether runs occur as frequently as one would expect, or if there are external factors that make them more/less likely? In our model, we assume no such external effects, the probability of any team to score when it has possession of the ball is a constant p, and does not depend on the past (whether team A has scored 0, 1 or 10 consecutive times already, the same way heads will come up 50% of the time with a fair coin even if we've just had a run of 10 heads or ten tails right before). And the fact that under this assumption theoretical and empirical values match so well would suggest that there are no external effects (or perfectly compensate each other!), and that when we observe 8-0, or 11-0 runs we were simply bound to see them occur.

Let's end all the modeling with a fun fact: any idea what the greatest run from these past years has been (with one team remaining scoreless)? 15-0? 19-0? Turns out it was 29-0, by the Cleveland Cavaliers led by LeBron James.... before his Miami days. Over the first two quarters of the game and almost 9 minutes, the Cavs scored 29 consecutive points on 19 possessions over the Milwaukee Bucks on Dec 6th 2009 (a day short of the Pearl Harbor Anniversary!).

Monday, August 24, 2015

There's also a Wall in Game of Thrones' ratings

A few months back, I looked at the evolution of various metrics (viewership, metacritic ratings, IMDB ratings) for the critically acclaimed TV show Game of Thrones, and also compared overall ratings for the show with Breaking Bad and The Wire which also obtain top ratings from critics and viewers. As I was doing this, I stumbled across the fact that the last Game of Thrones episode (the analysis being done in the middle of the fifth season) had a rating of 10 on IMDB. I had never seen that before. Not a single Breaking Bad or The Wire episode had notched this rating, Breaking Bad did manage to get two 9.8. But after checking back a few days later, the rating of 10 had fell to 9.8, and would continue to fall the next few days.

I had done another post on the evolution of movie ratings and had also noticed that ratings are typically at their highest when the movie comes out and then drop in the following weeks. Does the same phenomenon apply to TV series, albeit on a much shorter time scale?

I pulled IMDB ratings for the final three episodes of the fifth season of Game of Thrones every 15 minutes. There are no spoilers it what follows, unless you consider ratings a spoiler of some kind. While ratings won't tell you what happens (or in Game of Thrones' case, who gets killed!), it might yield insight as to how intense the episode is (or in Game of Thrones' case, how many people get killed!). Consider yourselves warned...

So here's the raw data:

And the total number of voters:

So here's what stands out:

  • all episodes go through a phase when their rating is 10, this occurs in the few hours before the episode airs (horizontal grey lines) and typically based on a few hundred voters only;
  • ratings don't start at 10, that's only the peak value right before air time;
  • rating drops very quickly as the episode airs and right after;
  • value after a few hours is essentially the final stabilized value (maybe another 0.1 drop a few days/weeks later), but much much faster than what we had observed for evolution of movie ratings which typically stabilized after a few months;
  • approximately 80% of voters voted during the first week after the episode was aired;
  • there are bumps in the number of votes when the following episode airs: I'd guess that it's either people who missed the previous episode and do a quick catch-up so they are up-to-date for that evening's new episode, or people who realizing that a new episode is airing that evening and are reminded that they did not vote for the previous week's episode

Now the real question is who are the few hundred people who rated the episode a 10? HBO employees trying to jumpstart the high ratings? Mega fans able to access an early version of the episode? Mega fans already voting without having seen the episode and assuming that all episodes are worth a 10? But then who are the handful of people who voted less then 10 before then?

I myself have not watched the fifth season but VERY intrigued by that eight episode. As of today, the rating is still 9.9, the highest of the entire series, but also higher than any Breaking Bad or The Wire episodes...

Thursday, July 23, 2015

Love at first sight? Evolution of a movie's rating

Every day, over 3 million unique visitors go to and I am very often one of those. With limited time to watch movies, I heavily rely on IMDB's ratings to determine whether a movie is a rental or theatre go/no-go.

My memory is sufficiently bad that I sometimes need to check a new release's rating a few days apart, but sufficiently good that I can remember rating changes. The most common scenario consists of me checking a whole bunch of ratings for different movies, then trying to talk my wife into going to see the one I like best, using IMDB's rating as an extra argument. She'll systematically - and skeptically - ask for the movie's rating (she's also part of the daily 3 million). And I'll say it's really good, something like 8.3, here let me show you... and a 7.4 appears on my screen and I look like a fool.

Of course, it's completely expected that IMDB ratings would evolve, and even more so when the movies have recently released and have few voters: I fully anticipate Charlie Chaplin's City Lights to still have an 8.6 rating a month from now, but don't think Ted 2 will still have a 7.2 next month, or even when this post gets published! But the question I had was how do movie ratings evolve? How long does it take them to reach their asymptotic value? Are movies over- or under-rated right around their release date? It would seem reasonable that people would have a tendency to overestimate new movies they just saw at the movie theatre. The bias is due to the fact that if they made the effort of going to see the movie shortly after it's release they were probably anticipating it would be worth their money and time. Therefore, they might overrate the movie after seeing it independently of its quality to remain coherent with their prior expectations ('coherency principle' in psychology).
An internet blogger by the name of Gary didn't really phrase it as such, taking the approach of insulting people who bumped 'Up' to the 18th position of best all-time movies in IMDB. A fun read.

To monitor rating evolution, I extracted daily IMDB data up until 2015 data for 22 different movies released in 2012: Branded, Cloud Atlas, Cosmopolis, Dredd 3D, Ice Age: Continental Drift, Killing Them Softly, Lincoln, Paranormal Activity, Paranorman, Rec3: Genesis, Resident Evil Retribution, Rise of the Guardians, Savages, Skyfall, Sparkle, The Big Wedding, The Bourne Legacy, The Dark Knight, The Expendables, The Words, Total Recall and Twilight: Breaking Dawn 2.

For each movie I recorded three maun metrics of interest: IMDB rating, the number of voters and the metascore (from which aggregates reviews to generate a unique rating out of 100 for movies, TV series, music and video games).

Here's an example of the data plotted for The Dark Knight:

Originally rated 9.2, it dropped to 8.8 in the first month, then dropped a little more to 8.6 after 6 months where it appears to have stabilized. I just checked and it seems to have dropped an additional 0.1 point, now at 8.5 three years after release. Let's now look at the number of people who voted for it:

The number of voters rapidly increased right after the release, and although it isn't increasing as fast afterwards, many people continue to vote for it. This curve is quite typical across all movies.

Finally, let's look at the metacritic score:

I'm not sure we can even talk about a curve here. Momentarily rated 85, metascore dropped to 78 at release and hasn't changed to this date. This is perfectly normal as the metascore is based on a small sample of official critics ('Hollywood Reporter', 'Los Angeles Times', 'USA Today'). Reviews are released around the same time the movie is, no critic is going to be reviewing the Dark Knight Rises today which is why metascores are so stable.

Ignoring the y-axes, the shape of the curves are quite similar across movies, though there are some outliers worth showing.

Increasing IMDB rating? Most movies seem to get overrated at first and stabilize to a lower asymptotic rating. But in certain cases we see the rating increase after the release, as seen here with The Big Wedding with Robert De Niro and Diane Keaton:

Still with The Big Wedding, staggered worldwide release dates are clearly highlighted from the shape of the number of voters.

Based on our small sample can we estimate the overestimation of a movie's rating when it is released. After how many months does the rating stabilize?

Combining the data for all the movies together and aligning them based on their release date (thus ignoring and seasonal effects), we obtain the following graph where the x-axis is weeks since release:

We see a steady decline in rating by about 0.6 points over a period of about 7 months. A more surprising phenomenon is the upward trend in ratings that starts about a year after the original release. The trend seems quite strong, however we should keep in mind
that our original sample size of movies was small (22), and we only have data beyond 75 weeks after release for a handful of those movies, so the upward trend on the very right could be completely artificial and a great example of overfitted data!

A similar break in trend occurs for the number of voters:

As for the metascore, it appears remarkably stable right after release for the reasons already mentioned previously.

In a nutshell, movies do appear to be very slightly overestimated at release time (assuming the long-term asymptote is a movie's "true" rating), and the difference in rating (approximately 0.3 between the first month and months 2 through 6) was small yet significant (based on a paired t-test).

So if you do use IMDB to help your movie selection, definitely keep in mind that while the movie is probably good, it most likely isn't as good as 'Up'.

Saturday, June 6, 2015

The "Overtime effect": Why things go crazy in the final seconds of regulation

Jeff Ely is an Economics Professor at Northwestern, and in 2009, he and one of his PhD students, Toomas Hinnosaar, wrote an blog post entitled "The Overtime Spike in NBA Basketball".

(Incidentally, it was after reading this post shortly after it had been published that I realized that very granular basketball data was publicly available and led me to generate so many basketball-related articles on this very blog).

As indicated by the title, Jeff and Thomas noticed that many more NBA basketball games ended in overtime than one would expect from considering both teams' final scores as independent random variables. This assumption does seem very flawed from the start anyways, as both teams adapt to the other team's playing style and general pace of the game. Except for blowout games (consider the recent 120-66 destruction of the Milwaukee Bucks by the Chicago Bulls in a Playoff game), there is a rather strong correlation between points scored by each of the teams:

But Jeff and Toomas went further than just highlighting the discrepancy between expected games with overtime (~2%) and actual games with overtimes (~6%), they uncovered a surprising spike in score difference which emerges only seconds before the end of regulation.

I recently thought back about this analysis and wanted to revisit it, looking at the following questions:
  • Do we still observe the same phenomenon nowadays?
  • Do we observe the same effect towards the end of overtimes? One could argue that overtimes are quite likely to lead to more overtimes given that we whatever behavior emerged at the end of regulation will probably appear again at the end of overtime, but also that in only five minutes versus 48 minutes, scores have much less time to diverge.
  • Do we observe the same effects during the Playoffs?

Jeff and Toomas' analysis used data from all games between 1997 and 2009, I pulled all successive years, from 2009 to 2015, separating regular season and playoff games (it is not entirely clear if the original analysis combined both types of games or focused on the regular season only). Similarly to the original analysis, I defined score difference as home team's score minus road team's score, so a positive value could be interpreted as homecourt advantage.

First off, here is the evolution of the mean and the standard deviation of the score differential throughout regulation for regular season games, followed by playoff games:

The curves are extremely similar, with the home team advantage gradually increasing throughout the game, especially in the second half of playoff games. But the standard deviations are very large compared to point differential. Interesting to see standard deviations increase at a decreasing rate and even decrease in the final minutes. This is probably in games with certain outcome where starters are puled out and losing team able to somewhat decrease the point differential. Given the standard deviation of point differential, it now makes sense that overtimes are theoretically quite unlikely.

Do we still observe the same phenomenon nowadays?

Let us look at an animation of the score difference as the game progresses (regular season games only):

I generated a similar video taking a closer look at the last quarter at a finer level of granularity (6s increments instead of 30s).

Everything behaves as expected for the first 47 minutes of the 48 minute game. On a slightly more technical note, if we were to assume that the scores for each team at any given point in time are approximately normal and independent, then the difference in the two would also be normal. This assumption doesn't not seem to be violated for most of the game, except when it matters most, right at the end of regulation:

While the final graph is somewhat surprising at first glance, it makes a lot of sense for those who have seen a few close games on TV. In the middle of the game, a team losing by a handful of points is not going to freak out and start radically changing its strategy. Points come and go quickly in basketball, losing by two points heading into the third quarter or even fourth is clearly not synonym for defeat. However, losing by two points with 10 seconds left is a whole different story. Defeat is in plain view. If you have possession of the ball, you need to score quickly and close the gap. If the other team has possession things look gloomier. You can't let them run the clock and need to get possession back. Teams do so by intentionally fouling, hoping the other team won't make all freethrows and get the ball back. If the game is tied with only a few seconds left, teams won't panick and intentionally foul, one team might go for a buzzer-beater but without taking unnecessary risks. So in other words, the closing seconds of a game have the particularity that:
  • wide score differences are a stable equilibrium, the losing team has essentially thrown the towel
  • small score differences are highly unstable, the losing team is going to seek to reach a score difference of 0 (see next case) or gain the lead in which case we remain in an unstable state with roles reversed
  • score difference of 0 is a stable equilibrium stuck between two highly unstable states
With this perspective, the distribution graph makes complete sense!

Do we observe the same effect towards the end of overtimes?

It wouldn't be too far-fetched to consider an overtime as a 5 minute version of a full game given that both teams start off with a tie. Here's the animation of the score difference over all overtimes (combining first, second, third... overtimes) in the regular season from 2009 to 2015:

So in a nutshell, we do indeed observe the same phenomenon, which makes perfect sense given that not only do we find ourselves in the same state of stable/unstable equilibrium in the last possessions of the game, but scores have also had less time (5 vs 48 minutes) to diverge.
But as divergence is less likely, is a second overtime more likely than a first overtime? What about a third overtime? Will scores diverge even less as players get tired, players foul out and stakes being raised leading to even more conservative game play?

Here are the numbers of interest:
For the 5876 regular season games considered, 373 went to overtime (6.3%).
Out of 373 the games that went to a first overtime, 62 went to a second overtime (16.6%).
Out of the 62 games that went to a second overtime, 15 went to a third overtime (24.2%).
Only one game of those 15 (6.7%) eventually ended in quadruple overtime, with the Hawks outlasting the Jazz 139-133.

Do we observe the same effects during the Playoffs?

Players, coaches, fans always state that Playoffs, with its increased pressure and more physical play, are an entirely different animal compared to the regular season. But what about the end-of-game behavior just observed? Losing a game can end a season, so one would expect score differences of a few points to be extremely unstable.

The following animation suggests that the behavior is actually very similar to what we saw earlier for regular season games:

(link to animation focusing on fourth quarter)

What about the occurrence of overtimes? Again, Playoff numbers coincide with regular season games with 28 of 308 (8.3%) of games going to overtime. Sample sizes then get quite small, but it's a fun fact to see that we've had more playoff games end in triple overtime (2) than in double overtime (1).

So to summarize, not only will natural game dynamics will make overtimes more likely to occur than one would naively expect, but overtimes are also quite likely to lead to subsequent overtimes. This is great news for the fans, and for the NBA's TV deals. Perhaps less so for teams that have another game the following day...