Friday, September 13, 2013

Tuesday Oct 29 2013: Bulls at Miami, who will win?

The 2013-2014 NBA season kicks-off Tuesday October 29th and the spotlight will be the return of two of the injured megastars: the Bulls' Derrick Rose in Miami, and the Kobe Bryant's Lakers host the Clippers.

On this blog we love drama, we love high intensity games, but we also like tough questions. Such as: who will win those two games? And while we're at it, who will come out victorious of the other 1228 games of the season?

In this post I will present my efforts to predict the outcome of a game based on metrics related to the two opposing teams.

The data

The raw data consisted of all regular-season NBA games (no play-offs, no pre-season) since the 1999-2000 season. That’s right, we’re talking about 16774 games here. For each game I pulled information about the home team, the road team and who won the game.

The metrics

After pulling the raw data, the next step was to create all the metrics relate to the home and road team’s performance up until the game I want to predict the outcome of. Due to the important restructuring that can occur in a team over the off-season, each new season starts from scratch and no results carried over from one season to the next.

Simple Metrics
The simple metrics I pulled were essentially the home, road and total victory percentages for both the home team and the road team. Say the Dallas Mavericks, who have won 17 of their 25 home games and 12 of their 24 road games visit the Phoenix Suns who have won 12 of their 24 home games and 13 of their 24 road games, I would compute the following metrics:

  • Dallas home win probability: 17 / 25 ~ 68.0%
  • Dallas road win probability: 12 / 24 ~ 50.0%
  • Dallas total win probability: 29 / 49 ~ 59.2%
  • Phoenix home win probability: 12 / 24 ~ 50.0%
  • Phoenix road win probability: 13 / 24 ~ 54.2%
  • Phoenix total win probability: 25 / 48 ~ 52.1%

Discounted simple metrics
However, these statistics seemed a little too simplistic. A lot can happen in the course of a season. Stars can get injured or return from a long injury. A new team might struggle at first to play with each other before really hitting their stride. So I included some new metrics which have some time discounting. A win early in the season shouldn’t weigh as heavily as one in the previous game. In the non-discounted world we kept track, for home games and road games separately, of the number of wins and number of losses, incrementing one or the other by 1 depending on whether the team won or lost. We do exactly the same here with a discount factor:

new_winning_performance = discount_factor * old_winning_performance + new_game_result

new_game_result is 1 if they won the game, 0 if they lost.

When setting the discount factor to 1 (no discounting), we are actually counting the number of wins and are back in the simple metrics framework.

To view the impact of discounting, let us walk through an example:
Let’s assume a team won it’s first 3 games then lost the following 3, and let us apply a discount factor of 0.9.

  • After winning the first game, the team’s performance is 1 (0 * 0.9 + 1)
  • After winning the second game, the team’s performance is 1.9 (1 * 0.9 + 1)
  • After winning the third game, the team’s performance is 2.71 (1.9 * 0.9 + 1)
  • After losing the fourth game, the team’s performance is 2.44 (2.71 * 0.9 + 0)
  • After losing the fifth game, the team’s performance is 2.20 (2.44 * 0.9 + 0)
  • After losing the sixth game, the team’s performance is 1.98 (2.20 * 0.9 + 0)

But now consider a team who lost their first three games before winning the next three:

  • After losing the first game, the team’s performance is 0 (0 * 0.9 + 0)
  • After losing the second game, the team’s performance is 0 (0 * 0.9 + 0)
  • After losing the third game, the team’s performance is 0 (0 * 0.9 + 0)
  • After winning the fourth game, the team’s performance is 1 (0 * 0.9 + 1)
  • After winning the fifth game, the team’s performance is 1.9 (1 * 0.9 + 1)
  • After winning the sixth game, the team’s performance is 2.71 (1.9 * 0.9 + 1)

Although both teams are 3-3, the sequence of wins/losses now matters. A team might start winning more if a big star is returning after an injury of due to a coach change or some other reason so our metrics should reflect important trend changes such as those. Unsure of what the discounting factor should be, I computed the metrics for various values.

Discounted opponent-adjusted metrics
A third type of metric I explored was one where the strength of the opponent is incorporated to compute a team’s current performance. In the above calculations, a win was counted as 1 and a loss as 0, no matter the opponent. But why should that be? Just like in chess with the ELO algorithm, couldn’t we give more credit to a team beating a really tough opponent (like the Bulls snapping the Heat’s 27 consecutive wins), and be harsher when losing to a really weak team?
These new metrics were computed the same way as previously (with a discounting factor) but using the opponent’s performance instead of 0/1.

Let’s look at an example. The Thunder are playing at home against the Kings. The Thunder (pretty good team) have a current home win performance of 5.7 and a current home loss performance of 2.3. This leads to a “home win percentage” of 5.7 / (5.7 + 2.3) = 71%. The Kings (pretty bad team) have a current road win performance of 1.9 and a current road loss performance of 6.1. This leads to a “road win percentage” of 1.9 / (1.9 + 6.1) = 24%.

If the Thunder win:

  • The Thunder’s home win performance is now: 0.9 * 5.7 + 0.24 = 5.37
  • The King’s road loss performance is now: 0.9 * 6.1 + (1 - 0.71) = 5.78

If the Thunder lose:

  • The Thunder’s home loss performance is now: 0.9 * 2.3 + (1 - 0.24) = 2.83
  • The King’s road loss performance is now: 0.9 * 1.9 + 0.71 = 2.42
If you win, you get credit based off of your opponent’s win percentage. If you lose, you get penalized according to your opponent’s losing percentage (hence the 1- in the above formulas). The worst teams hurt you most if you lose to them.
As seen from the example between a very good team and a very bad one, winning does not guarantee that your win performance will increase and losing does not guarantee your losing performance will increase. The purpose is not to have a strictly increasing function if you win, it is to get, at a given point in time, an up-to-date indicator of a team’s home and road performance.

The models

For all the models, the training data was all games except those of the most recent 2012-2103 season which we will use to benchmark our models.

Very simple models were first used: how about always picking the home team to win? Or the team with the best record? Or compare the home team’s home percentage to the road team’s road percentage?

I then looked into logistic models in an attempt to link all the above-mentioned metrics to our outcome of interest: “did the home team win the game?”. Logistic models are commonly used to look at binomial 0/1 outcomes.

I then looked into machine learning methods, starting with Classification and regression trees (CART). Without going into the details, a decision tree will try to link the regressor variables to the outcome variable by a succession of if/else statements. For instance, I might have tracked over a two week vacation period whether my children decided to play outside or not in the afternoon, and also kept note of the weather conditions (sky, temperature, humidity,...). The resulting tree might look something like:

If it rains without wind tomorrow, I would therefore expect them to be outside again!

Finally, I also used random forest models. For those not familiar with random forests, the statistical joke behind it is that it is simply composed of a whole bunch of decision trees. Multiple trees like the one above are “grown”. To make a prediction for a set of values (rain, no wind), I would look at the outcome predicted by each tree (play, play, no play, play….) and pick the most frequent prediction.

The results

As mentioned previously, the models established were then put to the test on the 2012-2013 NBA season.

As most of the models outlines above use metrics that require some historical data, I can’t predict the first game of the season not having any observations for past games for the two teams (yes, I do realize that the title of the post was a little misleading :-) ). I only included games for which I had at least 10 observations of home games for the home team and 10 road games for the road team.

Home team
Let’s start with the very naive approach of always going for the home team. If you had used this approach in 2013, you would have correctly guessed 60.9% of all games.

Best overall percentage
How about going with the team with the best absolute record? We get a bump to 66.3% of all games correctly predicted in 2013.

Home percentage VS Road percentage
How about comparing the home team’s home percentage with the road team’s road percentage? 65.3% of games are correctly guessed this way. Quite surprisingly, this method provides a worse result than simply comparing overall records.

Logistic regressions
Many different models were tested here but I won’t detail each. Correct guesses range from 61.0% to 66.6%. The best performing model was the simplest one which only included the intercept and the overall winning percentage for the home team and the road team. The inclusion of the intercept explains the minor improvement to the 66.3% observed in the “Best overall percentage” section.
Very surprisingly, deriving sophisticated metrics discounting old performances in order to get more accurate readings on a team’s performance did not prove to be predictive.

Decision tree
Results with a decision tree was in the neighborhood of the logistic regression models, with a value of 65.8%.

Interestingly, the final model only looks at two variables and splits: whether the home team's home performance percentage (adjusted with a discounting factor of 0.5) is greater than 0.5 or not, and whether the road team's performance incorporating opponent strength (discounting factor of 1, so no discounting actually) is greater than 0.25 or not.

Random Forests
All our hopes reside in the Random Forests to obtain a significant improvement in predictive power. Unfortunately we obtain a 66.2% value, right where we were with the decision tree, the regression model and more shamefully the model comparing the two teams overall Win-Loss record!
When looking at which variables were most important, home team and road team overall percentages came up in the first two positions.


The results are slightly disappointing from a statistical point of view. From the simplest to the most advanced techniques, none are able to break the 70% threshold of correct predictions. I will want to revisit the analyses and try to break the bar. It is rather surprising to see how overall standings matter compared to metrics that are more time sensitive. The idea that the first few games of the season matter to predict the last few games of the season is a great insight. This can be interpreted by the fact that good teams will eventually lose a few consecutive games in a row, even against bad teams, but that should not be taken too seriously. Same with bad teams winning a few games against good teams, they remain bad teams intrinsically.

From an NBA fan perspective, the results are beautiful. It shows you why this game is so addictive and generates so much emotion and tension. Even when great teams face terrible opponents, no win is guaranteed. Upsets are extremely common and every game can potentially become one for the ages!

Tuesday, August 20, 2013

Boston: City of Champions

This is what I saw the other day at the Boston Airport (minus a hundred other people taking their shoes off and laptops out of carry-ons):

All the championships won by a Boston team (Celtics, Bruins, Patriots, Red Sox) have their banner hanging from the ceiling right before security screening. My daughter noticed the banners too and asked if they were ordered by color. I replied that they were actually ordered by date and when the championship was won.

But taking a second look at the ordered banners showed that her interpretation was not very far off the mark.

5 red banners, 3 yellow, 11 green, 2 yellow, 5 green, 3 blue, 2 red, 1 green, 1 yellow... The same colors seem to be close to each other (with some slight variation depending on the ordering of the third Patriot Superbowl championship and Red Sox title). So the natural question is whether any conclusions can be drawn from the fact that the colors seemed to be grouped together?

It could very well be that this is all purely coincidental: with only four teams capable of winning championships, you would expect at some point the same team to win two titles without one of the other three winning in between. But would the groupings be so obvious? A hypothesis would be that every so often, one of the teams will dominate its sport and that for a certain period of time will win way more than the other three Boston teams. When a team wins a given year, it has a much greater probability of winning the next than on a random year. So color clusters are actually proxys for team dominance during a certain era.

First approach

To determine which of the two reasoning is most likely, I ran a few simulations. By a few I mean a million. I considered 17 Celtics championships, 7 Red Sox championships, 6 Bruins championships and 3 Patriots championships, then randomly sampled without replacement. Our measurement of clustering is simply the number of clusters observed. The minimum number is 4 when all teams win all their titles without interruption (17 green, 7 red, 6 yellow and 3 blue, or 6 yellow, 3 blue, 17 green and 7 red, or...). The max is 33 when teams keep interrupting each other:

In Boston history we have 9 transitions which is definitely on the low side. The histogram for a million simulations yields:

The average and median number of clusters was greater than 22. Not only is our observation of 9 (red vertical line) much lower, but not a single of our 1000000 simulation yielded a value less than 10! Talk about p-value!

Second Approach

We definitely simplified the true dynamics on championship winning by considering our urn of championships from which we picked. A second approach would be to look year by year at the probability of a team winning a championship.

Based on our observations of the last 114 years (assuming a start date in 1900) the Celtics won 17 times, the Red Sox 7 times, the Bruins 6 times and the Patriots 3 times. We can therefore use these empirical probabilities to generate other what-if scenarios where each year the probability of any team winning is independent of the past (no team dominance).
With this simple approach I only consider that a single team can win in any given year. Starting in 1900, I flip a very biased 5 sided-coin where Celtics come up with probability 17/114, Red Sox 7/114, Bruins 6/114, Patriots 3/114 and Nobody with 81/114. And then look again at how many clusters are obtained.

In the first approach we sampled without replacement from an urn with 33 championships. However in our new approach we could get no championship at all every single year or a championship every single year! We can't just compare the number of clusters, but should instead look at cluster average: # transitions / # championships. In real life the ratio is 8/33  ~ 0.2424. In our new 1000000 sample we were able to generate a few cases that out-performed real life where 8 transitions were also observed but with 34 and even 36 championships. Some very low values of averages were observed when the number of championships won was almost half (4 transitions and 19 championships, this was huge Celtics dominance!). That being said, only 39 out of 1000000 simulations yielded an equivalent or better ratio. Here's a plot of the distribution of cluster-to-championship ratio, with the red line indicating what we observed in Boston:

So no matter how simple our approaches, we have definitely put forward some evidence of sports dominance by Boston's teams (and luckily for our analysis, Boston's teams did not dominated at the same time too often or that would have created some high frequency alternating between the two teams, thus breaking our clustering metric).

I don't believe my three-year old grasped all the subtleties involved here, but I definitely hope all those bright colors will get here interested in stats!

Monday, July 8, 2013

Visualizing the quality of a TV series

I love TV series. Like many people, my first addiction was Friends. Only 20 minutes, 4 jokes-a-minute, it was easy to fit in your schedule.

Then came the TV series revolution, with more sophisticated scripts, each episode becoming less of an independent unit but one of the many building blocks in the narration of ever increasing complex plots. Skipping episodes was no longer an option, even more so given the cliffhanger of the last episode.
The first such series I remember were 24 and Lost.

I'm not going to list all my favorite shows, nor try to compare them or do any advanced analyses. I just wanted to share some visuals I created allowing quick summarization of the quality of a series, episode after episode, season after season.

Here is the visual for Friends:
The blue line displays the IMDB rating of the episodes within each season - notice the break between seasons and the gray vertical bars delimiting seasons. There is a lot of data represented, so I've added some summary stats at the very top of each season:

  • the number at the very top is the average of the IMDB rating for all the episodes of the season. The color on a red-green scale compares the seasons to each other: the seasons with highest averages are green, those with worst ratings are red.
  • the arrow under the season average represents the evolution of ratings within the season: if episodes tend to get better and better the arrow will point upwards and be green, if they get worse the arrow will point downwards and be red. If the ratings remain approximately average over the season a horizontal orange arrow is displayed.

Revisiting the Friends visual, we observe spectacular consistency in ratings over the ten-year period. The summaries on top allow us to see through the noise, and that seasons 8 and 9 were among the worst in average episode rating as well as the only two that got worse over the season. But the tenth and final season was the highest-rated one, and had a strong positive trend until the series finale.

Although not a huge fan, I had to look at the Simpsons' incredible run. The visual highlights the surprising drop and plateau of the season average starting around the ninth season.

Now looking at some of my favorite series:
I have to agree here that the first seasons of 24 were the best. I also love the way ratings evolve in the final season, when people had very low expectations and thought it would be the same old jack-bauer-aka-superman-prevents-nuclear-attacks-every-thirty-minutes but soon realized this was Jack Bauer on a revenge rampage without any rules.

Breaking Bad is the perfect example of the TV series that keeps getting better, throughout the seasons and throughout the years. Can't wait for the second half of the fifth season!

The last few seasons of Dexter aren't rated as high as the first ones, but ratings remain high.

I loved the first season of Prison Break but clearly it should have been a one-season series, you jsut couldn't top the suspense and originality.

Game of Thrones started really high but just kept getting better à-la-breaking-bad. The seesaw effect in the third season is rather impressive!

Wednesday, July 3, 2013

Hollywood's lack of originality: "Let's make a sequel!" (Part 3, yes I realize the irony)

As indicated by the title, this is part 2 of the analysis of movie sequels.
In the first post I described the IMDB data used for the analysis and shared some preliminary statistics on the distribution of number of movie installments in movie series.
In part 2 I focused on comparing IMDB ratings for the original movie and its sequel.

This post looks at series with 3 or more installments.

The sequel has a sequel!!!

Looking at multiple installments is a little tricky. Can I compare the average IMDB rating change between installments 1 and 2 with the change between installments 2 and 3? Probably, but only if I look at the same sample. Let me explain myself. To compute the change between installments 1 and 2 I might have 1000 series to look at (2000 movies then). But when looking at the change between 2 and 3, my sample will be smaller (I will no longer have all the series that only had two installments). Is that so much of an issue? It could be if there is what is called a "lurking third variable". Perhaps moviemakers only make a third installment when the second installment wasn't too bad, so series with only two installments could be biased in the sense that they are the ones where the second installment did really terrible and so no third installment was made. So if we really want to compare the drop-off between 1 and 2 with the drop-off (I am assuming it is another drop-off!) between 2 and 3, we should restrict the analysis to only series with at least three installments.

So there are a couple of things we might want to look at:
1) average change between installments 1 and 2 for all movies that have 2 and only 2 installments
2) average change between installments 1 and 2 for all movies that have 3+ installments
3) average change between installments 2 and 3 for all movies that have 3+ installments

Comparing 1) and 2) will give us an idea whether third installments are favored for series where second installments didn't do too badly. Comparing 2) and 3) will allow us to compare the 1 -> 2 and 2 -> 3 effects.

Here are the results:

Series length Sample Size From Installment ... To installment ... IMDB Rating Difference
All series 606 1 2 -0.87
Exactly 2 410 1 2 -0.92
3 or more 196 1 2 -0.78
3 or more 196 2 3 -0.33

So it does seem that series with exactly 2 installments had a larger 1 to 2 installment drop-off (-0.92) than those with 3 or more installments (-0.78), however another (unpaired) t-test revealed that the difference was not significant. Budget and revenue are probably more important factors than IMDB ratings taken into account at Hollywood before deciding whether to make a third movie, but I suspect there is still some correlation between rating and revenue (no worries a future post will address this!).

Another interesting finding is that the drop-off from 2 to 3 is much smaller than from 1 to 2. This can make sense: 
1) First of all, an IMDB rating can only go so low, you can't keep losing a full point rating everytime.
2) It is safe to assume the original movie was seen by a rather diverse crowd, whereas the second might have been seen only by the first movie's fan base, and very likely that those were about the same as those who saw the third. In other words, a much bigger overlap is to be expected between those who saw the second and third as opposed to first and second, which translates into more similar ratings.

More than 3?

For more than three installments, the sample starts to shrink quite rapidly, which is why I went for a more visual approach. I normalized all first installments to a score of 100 in order to track the evolution of all subsequent installments. This graph will allow us to shed some light on some hypotheses from the previous section: ratings go down but at a slower pace and tend might eventually converge with the same hard-core fan base rating the movies.

A slightly more confusing plot where trends are harder to establish but which nonetheless conveys both the initial rating drop-off as well as the sharp decrease in series length is the spaghetti-plot, where each line represents the evolution of a given series rating installment after installment.

From the previous two graphs, it appears that the decrease we observed over the first few installments continues until the fourth installment which is typically a series all-time low, with installments 5 and 6 usually bouncing surprisingly back. However, as shown in the spaghetti plot, the sample size is severely reduced and it would be dangerous to draw any strong conclusions.

It also seemed that quite a few series ended on surprisingly good ratings, sometimes the best or second best after the first installment:
  • The Rocky series was strictly decreasing from the start: 8.0, 6.9, 6.3, 6.2, 4.7 but finished with 7.2
  • The Rambo series had a similar pattern: 7.5, 6.0, 5.2, and...  7.1!
  • The Harry Potter series has ratings grouped between 7.2 and 7.6 for the first 7 installments but the eight and final finished at 8.1
Could it be that for these series a real effort was made to finish on a strong note? Stallone probably put more thought into the script waiting 16 years for the last Rocky whereas the first 5 came within an average of 3 to 4 years apart from each other. Same for Rambo within a 20 year lapse compared to the three year gaps for the first movies.

Also, in the cases mentioned above, the last installment was usually declared as such, in which case also ratings might be higher from fans saddened to witness the final installment.

Closing thought

What have we learned? The primary conclusion is that the common belief that sequels do worse than the original is definitely valid. Although not proven here, the most likely explanation is that Hollywood doesn't care about making terrible movies as long as they generate a profit but more importantly (since good movies are more likely to generate greater profits than terrible movies, a priori) they want as little risk as possible involved. A guaranteed million dollar profit is better than a 50/50 chance of generating either three million or losing a million.

Thursday, June 20, 2013

Hollywood's lack of originality: "Let's make a sequel!" (Part 2, yes I realize the irony)

As indicated by the title, this is part 2 of the analysis of movie sequels. In the previous post I described the IMDB data used for the analysis and shared some preliminary statistics on the distribution of number of movie installments in movie series.

In this post I will focus on comparing IMDB ratings for the original movie and its sequel.

The next post will look at series with 3 or more installments.

Number 2

I will first focus on the second installment. Analysis of installments 3+ will be dealt with further down.

First of all, how much time goes by before the second installment comes out?

Here's a quick look at the distribution:

Yes, 37 years separate the two installments of The Wicker Man series. It's actually a trilogy in the works with the third installment due in 2014. The first one came out in 1973, and the second in 2010!

A couple of outsiders aside, the vast majority of sequels come soon after the original movie: in 77% of cases less than 5 years separate the two, in 92% of cases 10 years separate the two.

Now for the meat of the analysis: how does the second movie's rating relate to the first one's? As done in previous posts, I will focus on the IMDB rating. Even if there are potentially many biases with this metric (die-hard fans, foreign movies not rated as well as US movies...), I was hoping that by looking at differences in ratings between sequels most of these biases would cancel each other out.

The following graph plots the second installment's IMDB rating against the first installment's IMDB rating. Dots above the diagonal indicate sequels that did better, dots under indicate those that did worse.

As expected, sequels more likely exist for profit reasons than for creating all-time classics.

A few fun facts you can re-use at your next dinner party:
  • Only 19.7% of second installments did just as well (4.6%) or better (15.1%) than the first.
  • Some of the worst decreases in ratings go to The Mask (Jim Carrey's original 6.7 movie plummets to 2.1 with Son of the Mask which he wisely stayed away from) and The Exorcist (the original grandiose 1973 classic went from 8.1 to 3.6 in just four years).
  • One of the best increases in ratings goes to Captain America (the last installment that came out in 2011 with a rating of 6.8 is actually considered the sequel to the original 1990 movie that has a rating of 2.9).
  • On average, sequels have an IMDB score 0.9 less than the first movie.
Simple linear models can be run using the data; the main question we can ask ourselves is whether to include an intercept:
Sequel rating = alpha * Original rating (+ intercept)

Actually, a quick graph reveals that the intercept has little impact on the fit itself:

These models suggest that a better relationship than the average 0.9 decrease between the first two installments is that the sequel's rating is either 0.9 + 0.72 * Original Rating or simply 0.85 * Original Rating based on which model you prefer. In both cases we still conclude that the sequel usually does worse.

But does it do significantly worse? Significance of the decrease can be assessed with a paired t-test. Quick stat reminder, you absolutely want a paired t-test here. If you were to do a naive t-test between two independent populations you would reach the counter-intuitive conclusion that sequels do just as well because the between movie spread of movie ratings is much greater than the within movie spread. In other words, suppose all movies have a uniform rating between 1 and 10, and all sequels systematically have a rating 0.3 less making the sequel range go from 0.7 to 9.7. A t-test would be unable to pick up the systematic downward shift (except with huge sample size). A paired test in necessary because the two measurements are dependent (not simply because we want to measure significance!).

In our case, the paired t-test returns that the observed decrease in IMDB ratings is significant and actually very highly so.

In the next post we will look at series with three or more installments and view how ratings evolve for these longer series.

Saturday, June 15, 2013

Hollywood's lack of originality: "Let's make a sequel!" (Part 1, yes I realize the irony)

Terminator 2, Rocky 3, Alien 4, Scary Movie 5, Fast and Furious 6... When does it stop?

When a first movie works well, we systematically expect a sequel to be released.

This can get quite annoying for the majority of movie watchers who are not part of the aficionados who will see the sixth installment of a series they did not care much about after the first installment. While making a follow-up movie sounds like easy money, coming up with a good movie is very hard when you think about it. Of course you potentially have good characters with whom the audience connected well in the first movie. But after that you no longer have the element of surprise, you don't have all the interesting scenes that introduce the characters, you have to go completely off the roof to surpass the intrigue, suspense, action of the first movie. And this often fails. Everybody "knows" that ratings for follow-ups are worse than for the original, but how often is that actually true? If they were always significantly worse, would producers continue to produce them? Are they significantly worse or just a tad?

In this and following posts I want to take a second look (!) at movie sequels.


A few words on the data. I pulled the list of movie series from wikipedia, at the following links:[one, two, three...]_entries

The lists were not 100% accurate, but I figured it did a rather decent job.

Then I merged all the series with my separately downloaded IMDB data. Again, not 100% perfect in the matching and merging procedure, and there were some discrepancies in title names and release dates, but overall I was able to keep drop-outs to a minimum.

Some clean-up was then executed, removing movies that came out straight to video or TV, incomplete series. Sequels are not a recent phenomena, as can be concluded from the four installment series of the Wizard of Oz in 1910, or even the six installment silent series of Sherlock Holmes from 1908 to 1910! However, I did not want to go too far back in time with the many biases around old movies involved and wanted to focus on the more recent "sequel effect". I thus looked only at series where the first installment was released after 1970.

This still left me with a little over 600 series and over 1500 different movies. Not too shabby to get a pretty decent idea of the sequel dynamics!

Series length

Before anything, let us take a quick look at the current status of series lengths. More than the actual numbers themselves it is primarily the distribution of series length that interest us. How many series stop after the second installment? How many go on to make an 8th? While pulling the data I arbitrarily cut-off at 10 so super long series such as James Bonds are not accounted for here.

As one might have expected, series are usually quite short, and series with exactly two installments account for over half of all series (53%) and series with exactly three installments account for another 25%.

In the next post, we will look at sequel ratings, and whether there is indeed a drop-off compared to the original movie.
In the post following that we will look at series with three or more installments and view how ratings evolve for these longer series.

Wednesday, May 29, 2013

If you're the San Antonio Spurs...

I don't know about you, but I've found these NBA Playoffs to be more exciting than the last editions. Aside from Indiana-Atlanta (no offense), all first round match-ups had some backstory to them.
It got even better in the second round, the Bulls shocking the basketball world by stealing Game 1 against Miami, with Stephen Curry catching fire NBA-Jam-style in San Antonio. And then we get to conference finals, with buzzer-beaters to win games, crazy rallies and every other game going to overtime. The drawback of course is that I cut my life expectancy by three years due to the adrenaline rushs.

Okay, enough of the NBA advertising (I swear I'm not getting a dime from David Stern), what do the stats whisper as of today May 29th? Well, Spurs have some waiting to do until they figure out who their Eastern opponent is going to be...

Indiana or Miami in the Finals??

Tough choice if you've watched the last games, but the stats are oblivious to feelings, only hard facts matter, and the Heat remain a strong favorite, with almost 66% probability. Here's the probability breakout:

Winner Number of Games Probability
Miami 6 32.5%
Indiana 6 17.3%
Miami 7 33.1%
Indiana 7 17.1%

Spurs' 5th trophy?

As of today, here are the each team's probability of bringing the Larry O'Brien trophy home:

NBA team Champion Probability
SAS 47.5%
MIA 39.6%
IND 12.9%

If Miami wins the series against Indiana, San Antonio's probability of winning it all drops to 39.8%:

Winner prob
San Antonio 4 4.4%
Miami 4 7.8%
San Antonio 5 8.2%
Miami 5 17.6%
San Antonio 6 15.2%
Miami 6 15.5%
San Antonio 7 12.0%
Miami 7 19.3%

If Indiana wins the series against Miami, San Antonio's probability surges back to 62.5%:

Winner prob
San Antonio 4 7.9%
Indiana 4 4.0%
San Antonio 5 18.9%
Indiana 5 7.3%
San Antonio 6 15.2%
Indiana 6 15.2%
San Antonio 7 20.4%
Indiana 7 11.1%

The big probability swing is of course due to the historical performance of both teams over the regular season and playoffs but also to the change in home court advantage: San Antonio has homecourt against Indiana, not against Miami.

So while San Antonio will not officially voice any preferences (and might even suggest Miami 'cause you're only considered the best if you beat the best), I can't help but feel that deep-down they got to hope for the unexperienced Pacers to end up against them on basketball greatest stage.

Sunday, May 19, 2013

Hollywood movie trends

One night in 1996, a twenty-three year-old Larry Page attempted to work out a way of downloading the Internet.

My objective the past few months was slightly less ambitious (and for that reason will probably not make me a multi-billionaire 20 years from now). I decided to download IMDB.

If you like numbers and stats (and movies!), IMDB is rather awesome and has the potential for endless statistical analyses. I already wrote a post a while back on the collaboration between Johnny Depp and Tim Burton and attempted to answer who had benefited the most of the collaboration. I have many other analyses in mind, but before looking at those, I thought I would first take a step back and look at the evolution of the movie industry almost from the very beginning.

I would first like to point out that as exhaustive as I would like my analyses to be, there are limitations. First of all, there most likely is a bias as to which movies make it in IMDB or not. There is also a bias as to who rates the movies and how. I watch quite a few movies and systematically notice that almost all French movies (even the really good ones) get pretty bad scores. Now that I think about it, this could be an analysis in itself!

The first thing I looked at was the distribution of ratings over time (gray boxplots) as well as the number of movies (red line) since the end of the 1800s.

From a size perspective, it is striking both how the number of movies per year has exploded hockey-stick style starting but also the abruptness of the saturation around 2005. There is a possibility that I was not able to pull all movies for the later years (2012 data seems particularly suspicious), but there still remains rather strong evidence that in a period of 15/20 years we have reached saturation in the number of movies coming out at around 10K a year. You'd still have to watch 28 a day if you wanted to see them all!

As for the distribution of IMDB scores, the median (black dot in the middle of the gray boxes) has remained remarkably stable over time, hovering around 6. The spread has increased slightly as well as the extremes but this is also a consequence of the increase in the number of movies produced: the more that come out, the more likely of getting very bad ones as well as very good ones!

Let's now take a quick look at the proportion of movies of the different genres over time. Again, IMDB is a little tricky here as a movie can have multiple genres. I here associated each movie to its primary genre. I did run the analysis where I took multiple genres into account (so that the original Star Wars was triple-counted as Action, Adventure and Fantasy). It turns out that the results were extremely similar. Out of the 28 unique genres, I have only represented those with sufficient fractions and with the most interesting trends.

The most remarkable trend is that of short movies. While almost 50% of the 1910 movies were short features, that number stabilized around 5% for almost 40 years until the 1990s, before increasing suddenly to almost 1 out of 3 movies nowadays. The trend for Documentaries has been very similarly, but stabilized at about 1 in 6 movies in the recent years.

Westerns were popular back in the days, but are almost unheard of as of 1970s. Instead the 1970s mark the simultaneous 30 year long golden age of both Action and Adult movies. I have to admit that I was surprised in the decline in Adult movies since the late twentieth century, as I figured the oldest job in the world would be continue to be the oldest inspiration of the world. Might be worth further investigation in another post (I do expect quite a few hits on that one...).

Here are some other interesting findings I made along the way:

War movies: Popular during war times

While most genres have trends similar to the one described for the first figure, with an explosion in number of movies around 1990 that has since then stabilized, War movies displays a nice exception:

War movies come during and after wars: loo at World War 2, Vietnam, a small spike around the first Gulf War, and a big surge right after 2001.

Horror movies: Stop making theses!!!

In terms of IMDB ratings, most genres follow the general trend of a long-term stabilization around 6. Horror movies have quite a different story to tell:

Although the number of Horror movies being releases follows the same S-curve we've seen for the general trend, ratings for Horror have essentially dropped continuously since the 1920s!

Film what?

While almost all genres still exist today (and actually released at an increased pace), the Film Noir genre is quite particular:
Its golden age period lasted less then 20 years around the 1950s, and no Film Noirs have been made since then. Quite unfortunate when we see that certain gems like "The Third Man" with Orson Welles fall in this category! 

Rated SM R for super mature

We observed earlier some surprising tendencies in the Adult movie category, here's a deep dive.. I mean a closer look:

I have some doubts on the reliability of these numbers, and don't think that the number of Adult Movies released per year in less than 500 (especially given that the estimated revenue from videos is estimated to be in the $0.5Billion - $1.8Billion range ! It really depends on whether IMDB performs some filtering as to which movies get added to the database.
That being said, even if the absolute number of movies released is biased, the distribution of the ratings is more trustworthy, and the "n" shape is quite interesting: ratings seem to have saturated at around 6.5 but then suddenly plummeted over the past few years.

So that was a quick overview of what can be pulled from IMDB. As mentioned at the beginning of the post, expect many more analyses to follow!