Monday, August 24, 2015

There's also a Wall in Game of Thrones' ratings

A few months back, I looked at the evolution of various metrics (viewership, metacritic ratings, IMDB ratings) for the critically acclaimed TV show Game of Thrones, and also compared overall ratings for the show with Breaking Bad and The Wire which also obtain top ratings from critics and viewers. As I was doing this, I stumbled across the fact that the last Game of Thrones episode (the analysis being done in the middle of the fifth season) had a rating of 10 on IMDB. I had never seen that before. Not a single Breaking Bad or The Wire episode had notched this rating, Breaking Bad did manage to get two 9.8. But after checking back a few days later, the rating of 10 had fell to 9.8, and would continue to fall the next few days.

I had done another post on the evolution of movie ratings and had also noticed that ratings are typically at their highest when the movie comes out and then drop in the following weeks. Does the same phenomenon apply to TV series, albeit on a much shorter time scale?

I pulled IMDB ratings for the final three episodes of the fifth season of Game of Thrones every 15 minutes. There are no spoilers it what follows, unless you consider ratings a spoiler of some kind. While ratings won't tell you what happens (or in Game of Thrones' case, who gets killed!), it might yield insight as to how intense the episode is (or in Game of Thrones' case, how many people get killed!). Consider yourselves warned...

So here's the raw data:

And the total number of voters:

So here's what stands out:

  • all episodes go through a phase when their rating is 10, this occurs in the few hours before the episode airs (horizontal grey lines) and typically based on a few hundred voters only;
  • ratings don't start at 10, that's only the peak value right before air time;
  • rating drops very quickly as the episode airs and right after;
  • value after a few hours is essentially the final stabilized value (maybe another 0.1 drop a few days/weeks later), but much much faster than what we had observed for evolution of movie ratings which typically stabilized after a few months;
  • approximately 80% of voters voted during the first week after the episode was aired;
  • there are bumps in the number of votes when the following episode airs: I'd guess that it's either people who missed the previous episode and do a quick catch-up so they are up-to-date for that evening's new episode, or people who realizing that a new episode is airing that evening and are reminded that they did not vote for the previous week's episode

Now the real question is who are the few hundred people who rated the episode a 10? HBO employees trying to jumpstart the high ratings? Mega fans able to access an early version of the episode? Mega fans already voting without having seen the episode and assuming that all episodes are worth a 10? But then who are the handful of people who voted less then 10 before then?

I myself have not watched the fifth season but VERY intrigued by that eight episode. As of today, the rating is still 9.9, the highest of the entire series, but also higher than any Breaking Bad or The Wire episodes...

Thursday, July 23, 2015

Love at first sight? Evolution of a movie's rating

Every day, over 3 million unique visitors go to and I am very often one of those. With limited time to watch movies, I heavily rely on IMDB's ratings to determine whether a movie is a rental or theatre go/no-go.

My memory is sufficiently bad that I sometimes need to check a new release's rating a few days apart, but sufficiently good that I can remember rating changes. The most common scenario consists of me checking a whole bunch of ratings for different movies, then trying to talk my wife into going to see the one I like best, using IMDB's rating as an extra argument. She'll systematically - and skeptically - ask for the movie's rating (she's also part of the daily 3 million). And I'll say it's really good, something like 8.3, here let me show you... and a 7.4 appears on my screen and I look like a fool.

Of course, it's completely expected that IMDB ratings would evolve, and even more so when the movies have recently released and have few voters: I fully anticipate Charlie Chaplin's City Lights to still have an 8.6 rating a month from now, but don't think Ted 2 will still have a 7.2 next month, or even when this post gets published! But the question I had was how do movie ratings evolve? How long does it take them to reach their asymptotic value? Are movies over- or under-rated right around their release date? It would seem reasonable that people would have a tendency to overestimate new movies they just saw at the movie theatre. The bias is due to the fact that if they made the effort of going to see the movie shortly after it's release they were probably anticipating it would be worth their money and time. Therefore, they might overrate the movie after seeing it independently of its quality to remain coherent with their prior expectations ('coherency principle' in psychology).
An internet blogger by the name of Gary didn't really phrase it as such, taking the approach of insulting people who bumped 'Up' to the 18th position of best all-time movies in IMDB. A fun read.

To monitor rating evolution, I extracted daily IMDB data up until 2015 data for 22 different movies released in 2012: Branded, Cloud Atlas, Cosmopolis, Dredd 3D, Ice Age: Continental Drift, Killing Them Softly, Lincoln, Paranormal Activity, Paranorman, Rec3: Genesis, Resident Evil Retribution, Rise of the Guardians, Savages, Skyfall, Sparkle, The Big Wedding, The Bourne Legacy, The Dark Knight, The Expendables, The Words, Total Recall and Twilight: Breaking Dawn 2.

For each movie I recorded three maun metrics of interest: IMDB rating, the number of voters and the metascore (from which aggregates reviews to generate a unique rating out of 100 for movies, TV series, music and video games).

Here's an example of the data plotted for The Dark Knight:

Originally rated 9.2, it dropped to 8.8 in the first month, then dropped a little more to 8.6 after 6 months where it appears to have stabilized. I just checked and it seems to have dropped an additional 0.1 point, now at 8.5 three years after release. Let's now look at the number of people who voted for it:

The number of voters rapidly increased right after the release, and although it isn't increasing as fast afterwards, many people continue to vote for it. This curve is quite typical across all movies.

Finally, let's look at the metacritic score:

I'm not sure we can even talk about a curve here. Momentarily rated 85, metascore dropped to 78 at release and hasn't changed to this date. This is perfectly normal as the metascore is based on a small sample of official critics ('Hollywood Reporter', 'Los Angeles Times', 'USA Today'). Reviews are released around the same time the movie is, no critic is going to be reviewing the Dark Knight Rises today which is why metascores are so stable.

Ignoring the y-axes, the shape of the curves are quite similar across movies, though there are some outliers worth showing.

Increasing IMDB rating? Most movies seem to get overrated at first and stabilize to a lower asymptotic rating. But in certain cases we see the rating increase after the release, as seen here with The Big Wedding with Robert De Niro and Diane Keaton:

Still with The Big Wedding, staggered worldwide release dates are clearly highlighted from the shape of the number of voters.

Based on our small sample can we estimate the overestimation of a movie's rating when it is released. After how many months does the rating stabilize?

Combining the data for all the movies together and aligning them based on their release date (thus ignoring and seasonal effects), we obtain the following graph where the x-axis is weeks since release:

We see a steady decline in rating by about 0.6 points over a period of about 7 months. A more surprising phenomenon is the upward trend in ratings that starts about a year after the original release. The trend seems quite strong, however we should keep in mind
that our original sample size of movies was small (22), and we only have data beyond 75 weeks after release for a handful of those movies, so the upward trend on the very right could be completely artificial and a great example of overfitted data!

A similar break in trend occurs for the number of voters:

As for the metascore, it appears remarkably stable right after release for the reasons already mentioned previously.

In a nutshell, movies do appear to be very slightly overestimated at release time (assuming the long-term asymptote is a movie's "true" rating), and the difference in rating (approximately 0.3 between the first month and months 2 through 6) was small yet significant (based on a paired t-test).

So if you do use IMDB to help your movie selection, definitely keep in mind that while the movie is probably good, it most likely isn't as good as 'Up'.

Saturday, June 6, 2015

The "Overtime effect": Why things go crazy in the final seconds of regulation

Jeff Ely is an Economics Professor at Northwestern, and in 2009, he and one of his PhD students, Toomas Hinnosaar, wrote an blog post entitled "The Overtime Spike in NBA Basketball".

(Incidentally, it was after reading this post shortly after it had been published that I realized that very granular basketball data was publicly available and led me to generate so many basketball-related articles on this very blog).

As indicated by the title, Jeff and Thomas noticed that many more NBA basketball games ended in overtime than one would expect from considering both teams' final scores as independent random variables. This assumption does seem very flawed from the start anyways, as both teams adapt to the other team's playing style and general pace of the game. Except for blowout games (consider the recent 120-66 destruction of the Milwaukee Bucks by the Chicago Bulls in a Playoff game), there is a rather strong correlation between points scored by each of the teams:

But Jeff and Toomas went further than just highlighting the discrepancy between expected games with overtime (~2%) and actual games with overtimes (~6%), they uncovered a surprising spike in score difference which emerges only seconds before the end of regulation.

I recently thought back about this analysis and wanted to revisit it, looking at the following questions:
  • Do we still observe the same phenomenon nowadays?
  • Do we observe the same effect towards the end of overtimes? One could argue that overtimes are quite likely to lead to more overtimes given that we whatever behavior emerged at the end of regulation will probably appear again at the end of overtime, but also that in only five minutes versus 48 minutes, scores have much less time to diverge.
  • Do we observe the same effects during the Playoffs?

Jeff and Toomas' analysis used data from all games between 1997 and 2009, I pulled all successive years, from 2009 to 2015, separating regular season and playoff games (it is not entirely clear if the original analysis combined both types of games or focused on the regular season only). Similarly to the original analysis, I defined score difference as home team's score minus road team's score, so a positive value could be interpreted as homecourt advantage.

First off, here is the evolution of the mean and the standard deviation of the score differential throughout regulation for regular season games, followed by playoff games:

The curves are extremely similar, with the home team advantage gradually increasing throughout the game, especially in the second half of playoff games. But the standard deviations are very large compared to point differential. Interesting to see standard deviations increase at a decreasing rate and even decrease in the final minutes. This is probably in games with certain outcome where starters are puled out and losing team able to somewhat decrease the point differential. Given the standard deviation of point differential, it now makes sense that overtimes are theoretically quite unlikely.

Do we still observe the same phenomenon nowadays?

Let us look at an animation of the score difference as the game progresses (regular season games only):


I generated a similar video taking a closer look at the last quarter at a finer level of granularity (6s increments instead of 30s).

Everything behaves as expected for the first 47 minutes of the 48 minute game. On a slightly more technical note, if we were to assume that the scores for each team at any given point in time are approximately normal and independent, then the difference in the two would also be normal. This assumption doesn't not seem to be violated for most of the game, except when it matters most, right at the end of regulation:

While the final graph is somewhat surprising at first glance, it makes a lot of sense for those who have seen a few close games on TV. In the middle of the game, a team losing by a handful of points is not going to freak out and start radically changing its strategy. Points come and go quickly in basketball, losing by two points heading into the third quarter or even fourth is clearly not synonym for defeat. However, losing by two points with 10 seconds left is a whole different story. Defeat is in plain view. If you have possession of the ball, you need to score quickly and close the gap. If the other team has possession things look gloomier. You can't let them run the clock and need to get possession back. Teams do so by intentionally fouling, hoping the other team won't make all freethrows and get the ball back. If the game is tied with only a few seconds left, teams won't panick and intentionally foul, one team might go for a buzzer-beater but without taking unnecessary risks. So in other words, the closing seconds of a game have the particularity that:
  • wide score differences are a stable equilibrium, the losing team has essentially thrown the towel
  • small score differences are highly unstable, the losing team is going to seek to reach a score difference of 0 (see next case) or gain the lead in which case we remain in an unstable state with roles reversed
  • score difference of 0 is a stable equilibrium stuck between two highly unstable states
With this perspective, the distribution graph makes complete sense!

Do we observe the same effect towards the end of overtimes?

It wouldn't be too far-fetched to consider an overtime as a 5 minute version of a full game given that both teams start off with a tie. Here's the animation of the score difference over all overtimes (combining first, second, third... overtimes) in the regular season from 2009 to 2015:


So in a nutshell, we do indeed observe the same phenomenon, which makes perfect sense given that not only do we find ourselves in the same state of stable/unstable equilibrium in the last possessions of the game, but scores have also had less time (5 vs 48 minutes) to diverge.
But as divergence is less likely, is a second overtime more likely than a first overtime? What about a third overtime? Will scores diverge even less as players get tired, players foul out and stakes being raised leading to even more conservative game play?

Here are the numbers of interest:
For the 5876 regular season games considered, 373 went to overtime (6.3%).
Out of 373 the games that went to a first overtime, 62 went to a second overtime (16.6%).
Out of the 62 games that went to a second overtime, 15 went to a third overtime (24.2%).
Only one game of those 15 (6.7%) eventually ended in quadruple overtime, with the Hawks outlasting the Jazz 139-133.

Do we observe the same effects during the Playoffs?

Players, coaches, fans always state that Playoffs, with its increased pressure and more physical play, are an entirely different animal compared to the regular season. But what about the end-of-game behavior just observed? Losing a game can end a season, so one would expect score differences of a few points to be extremely unstable.

The following animation suggests that the behavior is actually very similar to what we saw earlier for regular season games:


(link to animation focusing on fourth quarter)

What about the occurrence of overtimes? Again, Playoff numbers coincide with regular season games with 28 of 308 (8.3%) of games going to overtime. Sample sizes then get quite small, but it's a fun fact to see that we've had more playoff games end in triple overtime (2) than in double overtime (1).

So to summarize, not only will natural game dynamics will make overtimes more likely to occur than one would naively expect, but overtimes are also quite likely to lead to subsequent overtimes. This is great news for the fans, and for the NBA's TV deals. Perhaps less so for teams that have another game the following day...

Wednesday, May 27, 2015

Game of Thrones Season 5, is the best behind us? [NO SPOILERS]

[This post does NOT contain any spoilers. The reason is very simple, I myself have not yet started the fifth season (OK this might be the only spoiler, there are at least 5 seasons...), so I am only focusing on number of viewers and ratings for each episode aired.]

I'm not even going to present the TV series Game of Thrones. Merchandising for it is perhaps not as overwhelming and invasive as it is for Frozen, I doubt many of view have no idea what this show is about.

As mentioned previously, we are currently midway in the fifth season, and I figured it might be a good time to figure how good of a TV show Game of Thrones is, and how it is performing from a viewership and rating point of view.


The following graph displays the evolution of viewership for each episode of each season for the initial airing on HBO on Sundays at 9pm.

Two things clearly stand out: there is a strong steady rise corresponding to the series huge success. But, as is the case with any rise, when does it stop. Well, after reaching an all-time max with the first episode of the current fifth season (8.0 million), viewership has rather dramatically dropped with "only" 6.2 million for the sixth episode. This value is less than each episode of the previous fourth season. Will the decrease continue, or did we just have a few below-expectation episodes. Each season does have a small mid-season dip, somewhat due to the way each season is constructed as a highly-strategic chess game, with the different groups carrying out their tactics. The second half of the season should provide some element of response as to where we are headed.

To really put into perspective the importance of the decline in viewership from the last episodes, I forecasted viewership until the end of the fifth season once using all the available data (including the decrease, in blue), and once ignoring data for the fifth season (in green). The difference in forecasts are quite dramatic:

Metacritic ratings

Let's now turn towards metacritic ratings. According to the site that has made a name for itself by aggregating a wide range of critics into a single value ranging from 0-100, we observe a striking similarity in ratings and viewership:

Again, we observe a steady increase throughout the first four seasons, and the first decrease for the current fifth season.

IMDB ratings

Metacritic also contains user ratings, however I preferred to turn towards IMDB for individual episode ratings. Sample sizes were larger and extraction was easier.

Here is the evolution of the ratings for each episode:

The values for the last few episodes should be taken with a grain of salt. Episode 7 of the fifth season had a rating of 10.0 before it aired, but has now reached a rating more on par with the rest of the series, 9.3.

The trend here is somewhat different from what viewership and metacritic were indicating: it appears that season 5 episodes are rated just as highly as those from previous seasons.

Best series ever?

As indicated previously, I noticed that the last episode of Game of Thrones was briefly rated 10.0 right before airing. This led me to wonder whether episodes had ever been rated a perfect score of 10. It turns out that four of them had, all Dragon Ball Z episodes!

Relaxing to ratings above 9.8, I found 448 episodes (a little under half being Dragon Ball Z!). In the mix, two Breaking Bad (from the fifth and final season) and a Six Feet Under (series finale, also season five). No Game of Thrones, the top-rated episode has 9.7.

Looking at all these series and ratings, one can't help but wonder if there is some sort of agreement on the best show ever. There are naturally many ways and sources to compare them, but everytime, Game of Thrones, Breaking Bad and The Wire seem to top the list:
  • best overall IMDB rating (secret IMDB formula): Breaking Bad and Game of Thrones are tied at 9.5, The Wire has 9.4
  • average IMDB rating of each individual episode: The Wire has an average episode rating of 8.7, Game of Thrones an average of 8.5 and Breaking Bad an average of 8.3
  • metacritic season-by-season rating (critics): The final season of breaking bad got a 99 (5th highest all-time), The Wire seasons 3 and 4 each got 98 (8th/9th highest all-time), the best Game of Thrones season (fourth) had a 95 which is 20th highest all-time
  • metacritic season-by-season rating (users): This is where Breaking Bad dominates: all five seasons are in the top 12 all-time (positions 1, 3, 4, 11 and 12), the best Game of Thrones season comes in at position 20 only, while the Wire does rather well with three seasons in the top 10.

So no clear overall winner, but Game of Thrones however is the only one still running. But for how long?

Friday, May 15, 2015

Consequence of Morey's Law: Lucky vs Unlucky teams

In a previous post, I looked at a 1994 paper by Daryl Morey (current Houston Rockets GM) who investigated how a team's winning percentage was related to the number of points they scored and allowed, deriving the "modified Pythagorean theorem":

expected win percentage =
  pts_scored ^ 13.91 / (pts_scored ^ 13.91 + pts_allowed ^ 13.91)

At the end of his paper, Daryl explores teams who had the biggest delta between their actual and predicted wins. In 1993-1994, the Chicago Bulls and Houston Rockets top the list and Daryl refers to them as lucky teams. But why is lucked involved?

The rationale is that if you have two teams A and B with almost identical points scored and points allowed, we would expect them to have very similar win percentages. The only way to create a discrepancy (without changing points scored and points allowed... too much), is by changing the outcome of the very close games. So for all the games team A won by a point, flip the scores so that they lose by 1, and reversely for team B who now wins all the games they previously lost by 1. With this hypothetical construction, we will have two teams still with very similar points scored and allowed but potentially different records. It would make common sense that for very close games the probability of each team winning is around 50%, so winning or losing amounts to "luck", whether a desperation buzzer-beater is made or bounces off the back of the rim. And so it would make sense that teams with high discrepancies between actual and predicted wins were either much better or much worse than 50% in close games. Let's confirm.

Here's the table of teams with discrepancies greater or equal to 6 between their actual and projected records, ranked by year:

Team Year Scored Allowed Wins (proj) Wins (actual) Win %
NJN 2000 98.0 99.0 38 31 37.8
DEN 2001 96.6 99.0 34 40 48.8
NJN 2003 95.4 90.1 56 49 59.8
CHA 2005 94.3 100.2 24 18 22.0
NJN 2005 91.4 92.9 36 42 51.2
IND 2006 93.9 92.0 47 41 50.0
TOR 2006 101.1 104.0 33 27 32.9
UTA 2006 92.4 95.0 33 41 50.0
BOS 2007 95.8 99.2 31 24 29.3
CHI 2007 98.8 93.8 55 49 59.8
DAL 2007 100.0 92.8 61 67 81.7
MIA 2007 94.6 95.5 38 44 53.7
SAS 2007 98.5 90.1 64 58 70.7
NJN 2008 95.8 100.9 27 34 41.5
TOR 2008 100.2 97.3 49 41 50.0
DAL 2010 102.0 99.3 49 55 67.1
GSW 2010 108.8 112.4 32 26 31.7
MIN 2011 101.1 107.7 24 17 20.7
PHI 2012 93.6 89.4 43 35 53.0
BRK 2014 98.5 99.5 38 44 53.7
MIN 2014 106.9 104.3 48 40 48.8

So how did these teams fare in close games? I've labelled a team/year as High if they won 6 or more games than expected (8 teams from the previous list), Low if they lost 6 or more  games than expected (13 teams from the previous list), and Normal otherwise. I then look for each group their win percentage in closely contested games (final scores within 1, 2 and 3 points).

Final scores within 1 point:

Type # Wins # Games Win %
Normal 721 1439 50.1
Low 8 21 38.1
High 2 2 100.0

Final scores within 2 points:

Type # Wins # Games Win %
Normal 1760 3515 50.1
Low 20 52 38.5
High 10 13 76.9

Final scores within 3 points:

Type # Wins # Games Win %
Normal 2820 5617 50.2
Low 24 80 30.0
High 14 19 73.7

Our intuition was correct and so were Daryl's closing comments: teams can indeed be qualified as lucky and unlucky, some winning almost 3 out of 4 close match-ups, others losing 2 out of 3 tight games. This intangible "luck" factor is sufficient to explain why certain teams have much better or worse records than their offense/defense would typically lead to. It doesn't take much for to flip the outcome of an entire game.

As a quick aside, much has been said about the San Antonio Spurs this year and their drop from a potential 2nd seed to 6th seed entering the Playoffs. Most articles focused on their loss on the final day of the regular season which led to that seeding free-fall, but was excessive focus placed on that last game? Had they been particularly lucky/unlucky during the season? It turns out their record is a couple games lower than what the modified Pythagorean theorem would have predicted, and that they weren't particularly lucky or unlucky in their close games, winning 2 of 5 games decided by 1 point, and 6 of 13 decided by 3 points or less.

Saturday, February 28, 2015

Morey's Law: How do points scored and points allowed tie to win percentage?

It all started in baseball, when Bill James found a very elegant formula linking a baseball team's winning percentage to the number of runs it scored and allowed:

expected win percentage =
  runs_scored ^ 2 / (runs_scored ^ 2 + runs_allowed ^ 2)

Because the variables are raised to the second power, the formula became knows as the "Pythagorean expectation formula".

In 1994, Daryl Morey, one of the biggest proponent of analytics in basketball and now GM for the Houston Rockets, adapted the formula for basketball teams. The overall structure remains the same, but the power of 2 was replaced by 13.91. Here's an extract from Daryl's formula in STATS Basketball Scoreboard:

Essentially the same formula as for baseball but with 13.91 as the power:

expected win percentage =
  pts_scored ^ 13.91 / (pts_scored ^ 13.91 + pts_allowed ^ 13.91)

In this post, I wanted to further explore this formula and answer questions such as: how accurate is it? it was based on data up until 1993-1994, is it still accurate with today's data? are there other more accurate formulas out there?

To start off, I extracted all relevant statistics by team and by year for the past 15 complete seasons, going from 1999-2000 to 2013-2014.

Let's start by looking at how accurate Daryl's formula is when looking through these last seasons:

Well, formula still applies quite well to say the least! Of course, the exact coefficient might be slightly off so I used the more recent data, and fit the same model. The fitted value for the exponent turned out to be 13.86. Despite all the rule changes over the bast twenty plus years (three free throws on three-point fouls, hand-checking, clear path...) and the fact that the early nineties are regarded as a completely different era of basketball as now (somewhat linked to the rule changes), the value is almost identical, less than a 0.4% difference!

But back to the formula. It seems to perform remarkably well and fitting the data, but can we do better? There is room for additional flexibility: in Morey's formula, all three terms are raised to the same power. What if points scored and points allowed were allowed to be raised to different values?

expected win percentage =
  pts_scored ^ a / (pts_scored ^ a + pts_allowed ^ b)

Or if all three terms could have different powers?

expected win percentage =
  pts_scored ^ a / (pts_scored ^ c + pts_allowed ^ b)

When fitting those new more flexible models, it turns out that the fitted coefficients remain very close to 14. Naturally with additional we observe a decrease in residual sum of squares, but nothing extravagant either. We'll revisit this point later in the post.

But let's step back for a minute, what exactly does the exponent value correspond to? For values scored and allowed points ranging from 90 to 110, I generated three charts displaying expected win percentage for the respective exponent values of 2, 14 and 50.

We notice that the value controls how slowly/quickly the surface goes to 0 and 1 as the difference between points scored and allowed increases. When points scored is 100 and points allowed is 95, win percentages are 52% (exp of 2), 67% (exp of 14) and 93% (exp of 50).

But something else stands out in all graphs: they appear invariant in one direction, the first diagonal (going through points (90, 90) and (110, 110)). In other terms, a team allowing 90 points and scoring 93 has an expected winning percentage only very slightly off from another team scoring 110 and allowing 107. What truly matters is the delta between points allowed and scored, not the absolute value of these two numbers.

Let's do some very simple exploratory data analysis looking at actual win percentages against points scored and allowed:

As one would have expected there is definitely some correlation there - especially regarding points allowed (which could be an additional argument for promoting a strong defense over a strong offense, but that is for another time).

But things get really interesting when we look at the difference between points scored and allowed:

You don't often come across a correlation of 0.97 just plotting two randoms against each other in your data set! It looks like someone created a rectangular stencil in cardboard, placed it over tan empty plot, and asked their 3-year old kid to go polk-a-dot-crazy within the rectangular region. Can this strong relationship be leveraged for an alternative formula to Morey's?

A simple linear model begs to be fit, but would it also make sense to add a quadratic or cubic term? Quadratic (or any even number fo that matter) does not seem reasonable: a delta of 5 or -5 suggest VERY different types of performances, so only odd-numbered powers should be considered. Here's the plot with the fits from a single term and with an additional cubic term:

We've now come to the point where we have five models (the three "pythagorean" with various degrees of flexibility which I'll refer to as the single/double/triple power models based on how many coefficients are fit) and two linear models (with and without the cubic term). Can one be established as being significantly superior to the others? Will Morey's formula hold?

Of course the easiest way to compare would be looking at the fits and comparing residual sum of squares, but this will always lean towards the more complex models and yield to the overfitting problems we constantly hear about. So how do we go about it? Simply the way overfitting is dealt with in the abundant litterature: cross-validation. The data is randomly split in to training and testing datasets, the model is constructed based on the training data, but evaluated on the test data never seen before.

And the results are in!

Based on my random splits, it seems's that while all models perform very similarly, Morey's formula (simple power) has a slight advantage. It didn't achieve the minimal RSS, did yield the maximum, but median RSS was lower than all other models, though not significantly.

So after all this work, we weren't able to come up with a better way to reliably and robustly compute expected win percentages than a formula over 20 years old!

In a next post we'll dig a little deeper into the data and try to understand the largest discrepancies. In his original paper, what did Daryl Morey mean when referring to the Chicago Bulls as a lucky team in 1993-1994?

Monday, February 23, 2015


All NBA fans know about Shaqtin-a-fool.
Once a week, Shaquille O'Neal hosts this small segment on the NBA on TNT show. Five humorous video clips are shown, with players definitely not at their best. Erratic passes, obvious travels, missed wide-open dunks and layups, lost shoes...
The segment is also available on, and fans can vote for the best Shaqtin-a-fool moment.

For volume 4 episode 11 (they're referenced just like a TV series, with season and episode), and similarly to over 50% of the voters, I had voted for the last video clip shown which was that week's clear winner. A weird sensation I had been carrying over from week to week suddenly materialized: it seemed to me that the last video clip was winning a disproportionate number of times.

Two explanations came to mind: the video clips were not shown randomly in Shaq's segment, but sorted according to users' preferences. Or the human mind was biased with its short term memory, not exactly remembering the first clips, and finding the last disproportionately funnier.

It was all the more obvious for this episode 11, where the poll results were in the exact reverse order they were shown in:

But before investigating the human brain and mind too deeply, I first had to see if my brain wasn't the one tricking me, and sought statistical confirmation that there was indeed a bias favoring the last video shown.

First things first, data was required. Unable to automatically run a script to pull the survey results from, I manually went through the last 28 episodes (including some special episodes for the All Star Game, the Playoffs and past eras), noting for each video the order it was shown ("Input Order"), and the position it was in the survey results ("Output Ranking").

A quick first visual exploration of the data, linking Input Order to Output Ranking:

I added some jitter to avoid all the lines overlaying each other and hiding the number of observations. It did seem that the majority of the lines were in the steepest diagonal, indicating that the most common "transition" was from videos being shown in 5th position coming out first in the survey results. At least I wasn't imagining the whole thing!

Because the diagonal has longer length than horizontal lines, there could still be an optical illusion suggesting that indeed there are more lines when we are actually seeing more color from longer lines, not more lines. So I re-generated the same graph but reversing the order of the inputs, so that the last video shown is not labelled 1, and the first video shown is 5.

No visual trick here, definitely looks like the last video shown is the most likely to win the poll (horizontal lines going from 1 to 1).

Now for the statistical confirmation. The most suited test here is a chi-square, comparing observed counts with expected counts under the null hypothesis that video order doesn't matter and all videos are equally likely to end up in any position.

The first test I ran looked at the full data and all the Input Order - Output Ranking counts:

Output: 1 Output: 2 Output: 3 Output: 4 Output: 5
Input: 1 3 3 13 6 3
Input: 2 1 2 3 10 12
Input: 3 4 5 6 9 4
Input: 4 3 8 5 3 9
Input: 5 17 10 1 0 0

The chi-square strongly rejected the null hypothesis: input order and output ranking were strongly linked.

The second test focused uniquely on the winner of the poll. In which position was the winner shown?

The table below summarizes the data:

Input: 1 Input: 2 Input: 3 Input: 4 Input: 5
Count 3 1 4 3 17

That's right, in 60% of cases the survey winner was shown in last position! It's clear from the data that not all positions are created equally and a second chi-square confirmed this.

So back to Shaq. Now that we've confirmed that there is a strong bias, can we try explaining the phenomenon?

My first idea (perhaps having spent too much time doing analyses for marketing teams!) was that the videos were not randomly shown but already sorted according to expected viewers' preference. It's a possibility, but a rather weak one. What would be the rationale? To get people hooked on the show as the clips get funnier and funnier? Sure, but recall that the whole Shaqtin-a-fool lasts 2-3 minutes tops, I'm not not sure if users really need to get hooked. Plus, until they see the last videos, the audience has no way of determining whether the best videos have already been shown.

So, I'm actually leaning towards an unconscious bias. I think the same phenomenon occurs if you were asked to rank your best vacations. There might be some clear "great vacations" (honeymoon), and "bad vacations" (lost wallet, passport, got sick), but I believe that with equally enjoyable vacations, the brain might be tempted to rank the latest one higher. A modality effect has been documented usually to describe the improved recall of the last elements of a list, typically when these are presented visually or auditory. I'd be willing to bet something similar is at play here.

However, even if the survey results are much more predictable now, I'm still going to continue watching Shaqtin-a-fool religiously. For pleasure... and more data.