Sunday, December 9, 2012

Genetics and Regression toward the mean

"Regression toward the mean" is a common term coined in many different fields (sometimes quite abusively), but I wanted to take a quick look at it from a genetics perspective, given that this is actually where it originally comes from.

Sir Francis Galton first observed this phenomenon when comparing parents' heights to children's heights, and noticed that while there was a strong correlation between the two, there was also this phenomenon of extreme parent heights leading to less extreme heights in offsprings. As heights from one generation to another became less and less extreme and trended towards some constant, later identified as the mean, the expression "regression toward the mean" appeared.

(from http://www.animatedsoftware.com/statglos/sgregmea.htm)

Now an obvious paradox comes out from this: if all heights become less and less extreme and converge wouldn't we all be of the same height today? How is diversity maintained?

As mentioned I will look at this from a genetics perspective under some simple assumptions.

First model

To kick things off, I will consider a certain trait such as height, and assume that each child will have the average height of his two parents. For simulation purposes, I will consider a population of 100 males, 100 females, and each couple having exactly two children, one boy one girl in order to maintain population size and composition. The physical trait of the original population is generated via a discrete 0-100 uniform distribution.

Let's see how this evolves over 20 generations:
The green line represents the population average for the trait, the blue region the inter-quantile range (25th to 75th quantile, indicating that 50% of the population has a value in this range), and the gray delimits the min and max range of the population, so that 100% of the population is within the colored regions).

The population average remains constant which makes sense: if the children's height is the average of the parents' height, the average height of the new generation is going to be the same as for the previous generation. We clearly see the regression toward the mean in effect, the speed of convergence is rather fascinating. While we have a min and max of 0 and 100 in the first generation, the entire population has a value between 49 and 55 in the 10th generation.

Our assumption that the physical trait is the exact average of both parents is a stretch and few traits are passed on this way. Let us look at a more biologically-accurate model.


Second model

How is genetic information actually passed on from generation to generation? I will not go into the full details, but we know that we actually inherit two copies of each gene, one from mom one from dad. Assuming their is a gene that directly controls height, dad will send one version of this gene (called allele), and so will mom. Since they themselves have two copies of each, there is a 50/50 chance of which one my dad will give me, same for mom.

(from http://cikgurozaini.blogspot.com/2010/06/genetic-1.html)

So let's take another model which replicates this. Each parent has two alleles (again we draw from a discrete 0-100 uniform distribution) and randomly passes one to the offspring.
We also assume that the value we observe for the individual (phenotype) is the average of these two values.
So if dad has 10 and 50, we will see him as a 30, and if mom is 30 and 90, we will see her as a 60.
What about the offspring? Well, here are the possible outcomes:

  • 25% chance the child will be 10 and 30 (phenotype 20)
  • 25% chance of being 10 and 90 (phenotype 50)
  • 25% chance of being 50 and 30 (phenotype 40)
  • 25% chance of being 50 and 90 (phenotype 70)

Now this is where we see the regression to the mean effect: the expected value of the child is 45 (0.25 * 20 + 0.25 * 50 + 0.25 * 40 + 0.25 * 70), which happens to be the parents' average (0.5 * 30 + 0.5 * 60). But this is only an expectation, and diversity is maintained with the child's value possibly ranging from 20 to 70.

So what happens when we repeat the process over and over?
(For a reason I will explain later, I have increased population to 20,000 instead of 200 from the first model)



A thousand generations down the road not much has changed! Looking at the count of unique phenotypes and alleles confirms this population heterogeneity:



The expectation is for children to have their parents' average height, but random mix-and-matching ensure diversity. Also note that mutations can occur, thus creating newer alleles!

This model, much more accurate both from a genetics standpoint and from what we observe in our everyday life, explains the paradox behind the original regression toward the mean: it's not because the expectation is the mean that you will converge towards it.

So there we have it, regression toward the mean does not imply we will all be clones of each other down the road!

Side note on population increase:
At the start of the second modeling phase, I had mention I had increased the size of my population from 200 to 20,000. This is because alleles can disappear in small populations: If dad has alleles a1 and a2 and two kids, the probability that a1 never gets passed on is 25%. So at each new generation, each parental allele has a 25% probability of disappearing  Of course allele a1 could be present for another parent who might pass it one, but the fact remains that all alleles are in constant risk of not being passed on. When simulating over many generations, this is actually what happens, and once a given allele disappears within the population, it can never be generated again (forgetting mutations for an instant).

Looking at a population of only 200 individuals:

After 1000 generations, the entire population has converged to a unique height. The loss of alleles (and thus phenotypes) can be viewed here:
The number of unique alleles is decreasing as alleles can disappear but never re-appear. As for the number of phenotypes it can increase. Indeed, suppose mom and dad are both 10-30 (same phenotype of 20), their offspring can be 10-10 (phenotype 10) or 10-30 (phenotype 20) or 30-30 (phenotype 30) so potentially three different phenotypes. The number of phenotypes is nonetheless going to be driven by the number of alleles. The less different alleles there are in the population, the less different combinations you can create.

If we increase the sample size, we decrease the probability of alleles disappearing. It would be quite unlikely for all parents with allele a1 to not pass it down to their children! However, theory around random walk indicates that this is inevitable after sufficient generations, no matter the size of the population (this is what we started to observe in our 20,000 population with the number of alleles starting to decrease around generation 400).


Saturday, September 22, 2012

Evolution of movie ratings




We've all checked a movie's IMDB rating to decide whether to go see it at the movies, but ever notice how the rating seems to drop when you check a few days later?

We here look at the evolution of 18 movie's ratings before/after their release to see whether this holds on this very small subset.

The hypothesis would be that there is self-selection bias in a movie's ratings, with die hard fans being the first to go see it (most often before its official release) and rate it, but because of their die hardness rating it higher than the average audience member.

For the 10 movies of interest I extracted (almost) daily ratings. Then, for each movie I compare the daily ratings to the rating on day 0 as baseline. This is defined as "IMDB rating delta" in the following graph:

Despite the small sample size, we can clearly see a decreasing trend starting a few days before release and continuing downwards two weeks after the official release. On average, movies, lose 0.4 out of 10 in IMDB rating over that period.

After the second weeks, things start to stabilize somewhat, as the number of ratings increases and we tend towards the movie's final long term rating. Based on the plot it seems as if there is a slight increase from week 2 onwards, but this is most likely due to the small sample size which gets even smaller in the future (while I have day 0 data for all movies, I do not necessarily have day 30 data for all).

But what about the metascore (aggregated score from www.metacritic.com based on "official" critics)? The same type of plot can be generated:



The pattern is here somewhat the same with a sharp decrease right around release date, but stabilization is much quicker. This makes sense since, as noted in the earlier post on individual movie evolutions, critics tend to publish their opinions right around release dates, not weeks after.

Hopefully I will be able to extend the analysis to more movies over greater time periods, but data extraction is in real-time unfortunately, no "back to the future" to speed things up ;-)

Friday, August 17, 2012

The Dark Night Released



Today I wanted to take a closer look at the evolution of a movie's metrics around its release time. Do ratings go up, down? How long does it take for the ratings to stabilize? What about a movie's metascore (aggregated score from www.metacritic.com based on "official" critics)?

To kick this off, I looked at three movies for which I had sufficient data: Ice Age 3, Savages and the Dark Knight Rises. I was also interested in the Dark Knight Rises to see if the events surrounding the Aurora shooting had any impact.

Ice Age 3: Continental Drift

For this move the number of voters really jumped the week after the release (dashed vertical line).

As for the rating it dropped from a 7.1 pre-release "hype" to stabilize itself at 6.9. Interesting to see how quickly it stabilized (aside from the mini bump the second week, the rating of 6.9 was achieved the same weekend the movie was released).

Metascore followed a similar evolution with a sharp decrease right around release date. The stabilization in rating makes more sense as not many magazines and newspapers review the movie after its release.



Savages

Unfortunately I was not able to gather any pre-release data for this movie, but interesting to see again that the biggest jump in IMDB voters occurred two weeks after release.

Again, the rating very quickly stabilized a few days after release to 6.8.

 The metascore dropped fairly late although only by one point.


The Dark Knight Rises

 Despite trying to pull IMDB data up to two weeks in advance I was not able to collect any pre-release data for the third installment of the Dark Knight series.

Ratings dropped steadily from 9.2 to 8.9 over two weeks, again suggesting a fairly quick stabilization, especially as the number of voters increases.

The metascore had a sharp dropoff right around the release date, and the shooting, although it is unclear if there is any type of relation. Based on our very limited sample size it seems that metascore and rating always tend to drop around the release.



I am currently pulling more data for these movies and 15 others, so come back to check updated and new results. I will also try a meta-analysis to see if there is some common pattern in the evolution of the metrics across all movies.

Saturday, July 28, 2012

Improving odometer granularity

I used to have a very short commute by car but never gave it that much thought. However when my insurance asked me to estimate by yearly commute I was backed in a corner. It was only a few miles, but was it 2, 3 miles? Both numbers are small, but relative to each other that's a 50% difference!

And when multiplying by two trips each day and about 250 commuting days, that's quite a difference in total miles. I doubt that I would have benefited from a much larger rebate travelling 1000 or 1500 miles per year for commuting purposes, but it was too late, I needed to know.

The problem was not as simple as checking the odometer at the beginning and end of trip end of story. The issue was the odometer's granularity which only showed miles. Looking at the odometer more carefully over a few days showed that some days the odometer digit wheel would turn twice, sometimes three times. So what was my commute?



Intuitively it seemed like if I kept track of how many times the mile wheel turned each time and then averaged that out I could get an estimate with as much precision as I wanted. But was this intuition correct?

Well, let's say my odometer is at O + x miles, where O is the truncated mileage, and x is the fraction of mile started (x in [0, 1[). Looking at my odometer I only see O, but after (1-x) miles I will see O+1.
Let us also say that my commute is C + y where C is the truncated mileage and y the extra fraction mileage (y in [0, 1[).



So I originally see O miles on my odometer, and after my commute I will be at O + x + C + y miles, and the odometer will show:
  • O + C          if (x + y) < 1
  • O + C + 1    if (x + y) >= 1
We can assume x to be a random uniform variable on [0, 1[. y however is fixed and part of the commute.

So my odometer will indicate a commute of C + 1when x >= 1 - y, and C otherwise.

Therefore, by averaging my estimated daily commutes I will get an expected value of:
(1 - y) * (C + 1) + y * C = C + y which is the true value I wanted to get to.

So averaging the bi-modal values of C and C + 1 will provide me an unbiased estimate of my true daily commute.

Of course I could also have gone back and forth an entire weekend dividing total trip commute by number of back-and-forths...

But it's not like I have time to lose ;-)

Friday, July 20, 2012

Can the lottery code be cracked?



I just came across this article about a Canadian who broke the Scratch Lottery Code.

The article s about Mohan Srivastava who came across two unscratched Tic-Tac-Toe tickets near his desk. Not a big lottery fan but with nothing else to do, he scratched them. Lost on the first, won $3 on the second. He went to cash in his winning at the nearby gas station, but all the while he started thinking about how these tickets are created. Because the lottery corporation needs to keep careful track of how many winning tickets get printed, the computers can't just generate random numbers onto each card, winning and losing tickets need to be carefully created while giving the illusion of being completely random.

The more he thought about it, the more he became convinced there could be a way to determine whether a ticket was a winning ticket or not without having to scratch it. Here's what a Tic-Tac-Toe ticket looks like:



You have 8 3-by-3 grids with visible numbers. You then scratch the 24 numbers on the left. If any of t3 of the 24 numbers are lined up in any direction in any of the 8 grids, you've won!

Well Mohan bought many tickets, scratched them all, and ultimately found a relatively simple rule to determine ahead of time whether a lottery ticket is a winning ticket or not.

The trick? In your lottery ticket you have 72 numbers visible in the grids (8 * 3 * 3). Because the numbers go only up to 39, so will have to occur multiple times. Mohan kept track for each number how many times it occurred on the ticket:



He was especially interested in singletons that appeared only once. His finding was that if three singletons lined up, then you had a winning ticket!



I am not a 100% about why that is and how Mohan came to this conclusion, but intuitively I think the algorithm works something like this (at a very high level)

  • randomly separate numbers from 1 to 39 into two sets, 24 singletons and the 15 duplicates
  • fill in the grids with duplicate numbers, and with at most 2 singleton numbers
  • list the 24 singleton numbers on the left hand side

This will guarantee that the ticket is a losing one. You can make it a winning one by adding three singletons in a row, column or diagonal in any grid. This will guarantee that the ticket is a winning one with only one gain.

This is overly simplistic but it does allow a simple way to generate the tickets. Creating random grids and then testing whether they are winners or losers before printing them would take to much time, the above process creates them in one shot without testing.

On the example given in the article we can see that the 24 numbers given on left come out a total of 35 times in the grid (average 1.45), whereas the other 15 numbers not listed come out 37 times (average 2.47). This hints that the duplicates are indeed something like background noise, and the 24 numbers selected on the left are actually much less likely to occur in the grid. And this is where we are fooled: If we reveal 24 numbers, we should find a lot more of them in the grid made out of 39 different numbers shouldn't we?

The article also talks about the lottery industry in general and how the mob uses the lottery for money laundering.

And as to why Mohan revealed the trick instead of making a fortune? He simply calculated that if he were to spending all his time identifying winning tickets in each store across town, scratching them, and redeeming them, he would make less than his current full time job!

All in all, a rather interesting read, although I already gave out the end...

Tuesday, July 17, 2012

Media Noise Unit

There was a recent French article in Les Echos on a new concept aimed at measuring media intensity on news events: Media Noise Unit (unité de bruit médiatique in French).

The article is in French so I'll only summarize the gist of it. Three doctorate students derived this new metric to measure the intensity of an event across all media types (TV, print, radio) and aggregated into a single value based on the audience reach. So in a given day the French Elections might have an overall intensity of 412 for instance. The number in itself isn't important, it's how it compares to other events that same day and over past periods that is.

The main learning from a wider analysis based on this new metric is easily summarized: modern media tend to spend more and more time on less and less topics, what the researchers labeled "media craze". In a given day, only a handful of events will have very disproportionate intensity compared to the other topics, and the trend has sharpened over the years. The top "noise-makers" today makes twice as much noise as the top ones from 5 years ago.

                                                Noise units went through the roof in 2011

The article goes to describe the paradox behind this: in a world where access to information is getting wider and richer, we are being saturated with just a few events. However, the article goes on to note, humans have been proven to only remember up to three news events in a day, two of which they will forget within the next 24hours...

I found the article very interesting from a analytical point of view: I am especially fond of techniques that bridge the qualitative-quantitative gap. Being able to measure all the activity around a given topic seems like a very difficult task, but it enables many interesting insights to be driven afterwards. Comparing the intensity of the hotness topics over time is pioneering work!

But I am also a little skeptic about this technique and metric which seem almost to good to be true. Even a human would have a hard time labeling an article as being part of a certain topic or not, how well can this task be automated? During the French elections, where all articles where one of the candidates' name appeared be labeled as part of "french elections"? But what about routine actions the government did at that time, or visits abroad from the president? And certain events will trigger other more general analysis and essays that apparently have little to do with the original event. Are these counted or not?

I also saw no mention of online media which is a growing source of information. Restricting to TV, print and radio might be introducing some self selection biases to the analysis.

Anyway, even if not perfect there are some interesting insights. I would actually be very curious if these insights hold outside of France...


Thursday, June 21, 2012

Homecourt and rest time advantage

In a previous post, I looked at the true impact of homecourt advantage in the NBA, for the league in general and for each individual team. The model was simple, only considering whether the game was at home or away.

The main take-away was that playing at home bumped your probability of winning by almost 20 percentage points, from 40% to 60%. Quite a significant jump, although not every team observed the same jump.

I did however feel that the model was a little over-simplistic in ignoring another phenomenon which could impact a team's performance: rest time between games. Especially over the 2011-2012 condensed season with certain teams playing back-to-back-to-back games, one can definitely wonder how rest days come into play. If a team is playing on the road, can the fact that they have had three days of rest as opposed to their opponents back-to-back games mitigate the opponent's home court advantage?

The data and methodology are almost identical to the post I mentioned earlier: I looked at all 2009-201 and 2010-2011 games, and for each match-up looked at which team played at home and how many days of rest each team had.

Since we are now looking at multiple variables instead of just the homecourt impact, I will only provide the breakdown of results for the league in general, providing them for each team would just take up too much space.


Impact of rest days

The following table provides the victory probability based on where the game is played and the number of rest days for both teams.

Team A at home Rest days (Team A) Rest days (Team B) Win probability
Yes 1 1 59.1%
No 1 1 40.9%
Yes 2 1 65.2%
No 2 1 47.4%
Yes 3+ 1 62.6%
No 3+ 1 44.6%
Yes 1 2 52.6%
No 1 2 34.8%
Yes 2 2 59.1%
No 2 2 40.9%
Yes 3+ 2 56.4%
No 3+ 2 38.3%
Yes 1 3+ 55.4%
No 1 3+ 37.4%
Yes 2 3+ 61.7%
No 2 3+ 43.6%
Yes 3+ 3+ 59.1%
No 3+ 3+ 40.9%


Some interesting highlights are that:
  • independently of the number of rest days each team has had the difference homecourt advantage is always around 17-18%
  • the homecourt effect is much more predominant than the number of rest days: even in the best case scenario, the win probability on the road is 47.4%, so essentially a +7% percentage uplift due to rest days, as opposed to the +20% we saw in the previous post for the homecourt advantage impact.
  • it turns out that resting 2 days improves probability of victory compared to one day only, and three or more days is also more beneficial than one day only, two days is actually preferable to 3 or more days. This is also a debate that comes around often especially during playoff time, where one team comes out of a game 7 to meet a team that finished a sweep over a week before. Is too much rest a bad thing? From this data it does appear that 2 days provides the optimal balance between hitting your stride while you're hot and resting your sore legs.

Team's optimal rest days

What is true for the league isn't necessarily true for individual teams. I wanted to check if all teams preferred to rest 2 days instead of 1 or 3+ days. Were younger teams eager to have back-to-back games? Were older teams dreadful of tight schedules?

Team Significant Optimal rest days
NBA Yes 2
ATL Yes 2
BOS No 3
CHA Yes 2
CHI Yes 2
CLE No 2
DAL No 3
DEN Yes 3
DET Yes 3
GSW Yes 1
HOU No 2
IND Yes 2
LAC Yes 2
LAL Yes 1
MEM Yes 2
MIA No 2
MIL Yes 1
MIN Yes 1
NJN Yes 2
NOH Yes 3
NYK No 3
OKC No 3
ORL No 2
PHI No 2
PHO Yes 3
POR Yes 1
SAC No 1
SAS No 2
TOR Yes 3
UTA No 3
WAS Yes 3


Upon close inspection there does not seem to be any strong correlation between the team's age and the preferred number of rest days. Sure Boston is an old team preferring over three days and Golden State is one of the youngest team performing best on back-to-back games, but the Lakers are an old team also preferring back-to-back teams and the Wizards are a young team with best odds after 3+ days of rest.

To conclude, while rest days do influence performance in different ways for different teams, homecourt advantage remains the most impactful variable for outcome prediction of a game.



Monday, June 11, 2012

Thunder VS Heat: Stormy match-up

Now that the final two final contenders, it's time for the final predictions of the 2012 NBA season!

On Sekou Smith's Hang Time Blog the experts favor Oklahoma City 5 votes to 1, but what do the stats say?

The same model that was used to correctly predict the Thunder in 5 against the Lakers, and had slightly favored the Spurs in 7 over the Thunder in 6, gives a small advantage to OKC given its track record and homecourt advantage, but the margin is extremely close:

Winner Num games Probability
OKC 4 6.4%
MIA 4 6.0%
OKC 5 13.8%
MIA 5 11.3%
OKC 6 15.0%
MIA 6 16.3%
OKC 7 17.0%
MIA 7 14.4%

So if I had to put my money down, it would be for the Thunder in 7 as 3 NBA.com experts claimed.
But be careful, Miami in 6 is a very close possibility!

Saturday, June 9, 2012

Is Dexter getting better?

My latest entries have mostly been basketball-focused but the highly anticipated playoff matchups are to be blamed for that!

So after my post on the Johnny Depp - Tim Burton collaboration, I would like to take a stab at tracking the evolution of TV series. I think there are a broad type of questions that can be considered, such as:

How does the rating of individual episodes evolve throughout the course of TV series lifetime? Are there really "good" and "bad" seasons? Do TV series get cancel when the ratings go down by too much? Is there a common threshold? Do all seasons have high-rated cliffhangers at the end of the season?

Data

Similarly to the Johnny Depp analysis, I will be extracting my data from IMDB. To start off, I will focus on one particular TV series (and personal favorites): Dexter.

Plot

Let us plot the evolution of the individual episode ratings by "time":
Two main insights stand out:
  • there appears to be a "seasonal" pattern within each season:
    - the ratings either stay flat or go down in the first few episodes
    - the ratings then shoot upwards during the second half of the season
  • after 5 seasons of overall similar quality, it appears that the last season has not performed as well. For the first time in over 5 years ratings dropped before 8.0, and even the strong season finale was the lowest-rated finales.
If we compare the distribution of season 6's ratings with those of all prior seasons, the difference jumps out:
The best season 6 episode has a rating barely greater than the median rating of all past seasons!

I will start looking at other TV series and see if an overall low-rated season is the beginning of the end (hopefully not!). Do networks quickly panic and cancel shows as soon as they start dropping in overall quality?


Wednesday, May 30, 2012

Home-court advantage in the NBA

We have heard the concept countless times and almost take it for granted.

Home-court advantage. "They should win the next since they're playing at home." "They managed to steal one on the road."

But doesn't it all come down to Team A versus Team B? If A is a better team it should win no matter the location. The court has the same dimensions, the baskets are identical, the shots have the same likelihood of falling in. It's not like being server or receiver in a tennis game.

Or is it? There has actually been quite some studies around home-court advantage in an attempt to tease out the external factors that could cause it. Many potential causes have been brought forward: the home crowd of course, cheering when the hometeam gains momentum. The fact that the players in the home team can sleep at home instead of being in a hotel. Familiarity with the locker rooms, the facilities in general. Mostly psychological explanations difficult to accurately measure.

Or just consider the distractions when shooting free-throws:


I don't have a degree in psychology so I will tackle from the data point of view, and try to see how playing at home can impact the game's outcome.


Data

I looked at all NBA games (regular season only) for the past two years. I didn't go further back as other factors could have come into play such as the differences in team composition.


Methodology

For each team, and for the league in general, I computed the empirical probabilities of winning at home and on the road. The values can be directly computed form the creation of a two-by-two table win/loss VS home/away. I choose to approach the problem via a logistic model which provides exactly the same point estimates but in addition provides confidence intervals which in turn allow me to determine whether the homecourt advantage is significant or not.


Results

If an NBA team plays against another NBA team, how does the probability of victory change depending on where the game is played?

With a simple model where the only variable is where the geme is played I obtained that for the NBA in general, playing at home offers a +19.8% winning probability (59.9% VS 40.1%). This was a very significant uplift. There we have it, homecourt advantage is very present in the NBA.

Does this +20% hold for all NBA teams?

Here the variability is much greater, ranging from +37.8% for the Denver Nuggets (81.7% at home vs 43.9% away) to only 2.4% for the Dallas Mavericks (69.5% at home VS 67.1% away).
Full team-by-team table is at the end of the post.

Does any team player better away than at home?

No, all teams play better at home, even if by only a little such as the Dallas Mavericks where the uplift is only 2.4%.

Is homecourt advantage statistically significant for all teams?

Homecourt advantage was significant for all teams except 8: Dallas Mavericks (+2.4%), Miami Heat (+3.7%), Boston Celtics (+9.8%), Philadelphia 76ers (+9.8%), Oklahoma City Thunder (+11.0%), Sacramento Kings (+11.0%), Houston Rockets (+13.4%), New York Knicks (+13.4%). These are not necessarily all good or bad teams, just teams that are just as good (or bad) away as at home.


Closing thoughts

Going back to Denver and its overwhelming homecourt advantage (+37.8%) compared to the second-best advantage +32.9% for the Los Angeles Clippers, what might wonder if the altitude isn't the Nuggets biggest fan...


In a later post I will explore how the model can be tweaked to take rest time between games into account. Does more rest improve your winning probability?


Appendix: Full table

Team Home win % Away win % Delta % Significance
DAL 69.5% 67.1% 2.4% No
MIA 65.9% 62.2% 3.7% No
BOS 69.5% 59.8% 9.8% No
PHI 46.3% 36.6% 9.8% No
OKC 69.5% 58.5% 11.0% No
SAC 35.4% 24.4% 11.0% No
HOU 58.5% 45.1% 13.4% No
NYK 50.0% 36.6% 13.4% No
MIN 26.8% 12.2% 14.6% Yes
CLE 57.3% 40.2% 17.1% Yes
LAL 78.0% 61.0% 17.1% Yes
POR 68.3% 51.2% 17.1% Yes
UTA 64.6% 47.6% 17.1% Yes
ORL 76.8% 58.5% 18.3% Yes
PHO 67.1% 47.6% 19.5% Yes
NBA 59.9% 40.1% 19.8% Yes
CHI 73.2% 52.4% 20.7% Yes
NJN 32.9% 11.0% 22.0% Yes
ATL 70.7% 47.6% 23.2% Yes
DET 46.3% 23.2% 23.2% Yes
MIL 61.0% 37.8% 23.2% Yes
SAS 79.3% 56.1% 23.2% Yes
MEM 64.6% 40.2% 24.4% Yes
TOR 50.0% 25.6% 24.4% Yes
NOH 63.4% 37.8% 25.6% Yes
WAS 42.7% 17.1% 25.6% Yes
IND 57.3% 26.8% 30.5% Yes
CHA 63.4% 31.7% 31.7% Yes
GSW 53.7% 22.0% 31.7% Yes
LAC 53.7% 20.7% 32.9% Yes
DEN 81.7% 43.9% 37.8% Yes



Wednesday, May 23, 2012

NBA: Spurs VS Thunder

Time for a playoff update now that the two contenders for the Western Finals are known.

Everybody wants to see Spurs VS Heat, but before then comes the Thunder hurdle, and despite the Spurs impressive performance up til now, this is definitely going to be a tough challenge.

The method is exactly the same from my previous post (which correctly identified Thunder beating the Lakers in 5 as the most likely scenario!). Entering the last numbers in the model, here's what came out:

Probability of Spurs winning the series: 55.6%

Series breakout:

Winner Number of games Probability
Spurs 4 6.6%
Thunder 4 5.3%
Spurs 5 15.8%
Thunder 5 9.5%
Spurs 6 14.3%
Thunder 6 16.9%
Spurs 7 19.0%
Thunder 7 12.7%

The three most likely scenarios are:
Spurs in 7 (19.0%), Thunder in 6 (16.9%) and Spurs in 5 (15.8%).

For the overall playoffs, the latest numbers suggests the West as championship favorites for now:

NBA team Champion Probability
SAS 34.4%
OKC 25.7%
MIA 21%
BOS 12%
IND 4.6%
PHI 2.4%

Verdict in the upcoming weeks!

Thursday, May 17, 2012

Lakers - Thunder Series

This post is actually an expanded comment to Sekou Smith's Hang Time Blog on nba.com concerning the Lakers - Thunder series.

This series is one everybody has been waiting for since the start of the season.

Experience VS youth.
Kobe VS Kevin.


VS



A blowout in game 1.
An incredible comeback in game 2.

What's in store for the next 2 + X games?

5 nba.com's experts on Sekou's blog give their predictions after 2 games: one says 4, two say 5, and 2 say 6.

But what do the stats say?

I recently updated my model from the last two posts (here and here) in two ways: homecourt advantage is now incorporated (another post soon on this topic, namely how we can quantify it, whether all teams have a significantly higher probability of winning at home than on the road, and which teams have the greatest delta in home ganes vs away games), and by providing more details on each series with not only the probability of one team winning it but also the breakout in how many games the series will play out.

Which is exactly what I did here for the Lakers - Thunder series.

And now for the results:

Winner Number of games Probability
Thunder4 22.2%
Thunder 5 30.4%
Lakers 6 5.8%
Thunder 6 17.2%
Lakers 7 9.5%
Thunder 7 14.9%

So Thunder in 5 is actually the most likely scenario, followed by Thunder in 4 and in 6. Overall, if you're a Laker fan you should feel depressed with Lakers having only a 15.3% probability of facing the Spurs. But I have to admit that I haven't factored Kobe-back-against-the-wall variable in my models :-)

Let me know your thoughts!

Tuesday, May 15, 2012

2012 NBA Playoffs: Updated forecasts


What a first round this has been!

Things were rather quickly expedited in the East, including the surprising elimination of the #1 team Chicago Bulls, surprising until we saw the following video at least:




Meanwhile, the West was really the wild wild west and gave us some thrilling comebacks and two stressful game sevens.

Chicago was the favorite to win the Championship after the first two games of the playoffs with an estimated probability of victory of 17.9%. Its elimination has freed up some room but for whom?

Oklahoma City, San Antonio and Miami were the runner ups, and while the names of the next three teams hasn't changed, their order has:

NBA teamChampion Probability
MIA21.5%
OKC21.2%
SAS19.3%
BOS10.2%
LAC7.9%
IND7.3%
PHI6.6%
LAL6%


However the results are slightly biased as of now in the sense that Miami and Oklahoma won their round 2 opener whereas San Antonio still hasn't played Game 1 against the Clippers. If it were to win, it would jump right back to the first spot with a probability of 24.6% of clinching the Larry O'Brien trophy, more than 3 percentage points ahead of Miami and Oklahoma City.

More updates at the end of round 2!

Monday, May 14, 2012

The Johnny Depp / Tim Burton collaboration

I don't think anybody could have remained oblivious to the new Dark Shadows movie coming out:



Yet another Johnny Depp / Tim Burton collaboration, it seems those two have been in the movie business forever ! Edward Scissorhands, Sleeph Hollow, Charlie and the Chocolate Factory, Alice in Wonderland, now this !

So this begs the question: why? What do I mean "why"? Well, do the two just really like working together, or have they both determined that their partnership was mutually beneficial in terms of the quality of the movies created together?

I pulled IMDB data for Johnny Depp and Tim Burton separately focusing only on Johnny Depp as an actor and Tim Burton as a director (did you know he was in the list of actors for M.I.B. 3 ???), and labelled the movies either as "Common movies, "Johnny Depp only" or "Tim Burton only".

Collaboration VS Solo

Here is a graph summarizing for each of the three categories the IMDB ranking of the movies:



The above plot in question is called a boxplot and is a quick way to compare sets of data. The dark bold horizontal line is the median, and the gray rectangles represent the 25%-50% interquantile range, meaning that only 25% of movies will have rating greater than the top of the rectangle, and only 25%
will have a value less than the bottom of the rectangle. The dashed lines (called "whiskers") give an idea of the spread of the most extreme values.

For instance, we see that the median value for "Common movies" is around 7.5, and the data is rather concentrated (no movie had a rating better than 8, and none worse than 6.5).

Now comparing to the movies Johnny and Tim did solo, we see that while their joint work did not produce their best-rated movies (8.2 with Platoon for Johnny, and 8.4 with Vincent for Tim), it definitely limited risks with no movies worse than 6.5, whereas at least 25% of the movies Johnny or Tim did by themselves got worse than 6.5.

What about gross revenue?

Another comment about the boxplot: the round circles represent 'outliers' in the sense that they are values way beyond the spread observed in the data.

For "Common Movies", the outlier is Alice in Wonderland which generated just over a billion dollars, despite being, ironically, their worst-rated movie together at 6.5!

For Johnny Depp, the data is quite interesting: all his solo movies seem to have generated less than 150 million dollars, except four which made 4 to 7 times that amount. No surprises here, all four are Pirates of the Caribbean. I'm sure Johnny Depp bank account is looking forward to the fifth installment!

As for Tim Burton, the movies he did with and without Johnny Depp have very similar profiles.

Ratings and box office revenue can be combined in a scatterplot:


On a side note, it is interesting to see from the above graph the relationship between IMDB rating and revenue for the Johnny-Tim collaboration (blue dots): the greater the revenue, the lower the rating!

Evolution over time

The natural follow-up question is how this collaboration fits with the historical trends for both Johnny Depp and Tim Burton.

In terms of ratings, the collaboration had a significant impact on movie quality for Johnny Depp at the beginning of his career but very little in recent years (which was what the earlier boxplots hinted at earlier), whereas the impact was essentially insignificant for Tim Burton (but it's interesting to notice that 5 of Tim Burton's last six were with Johnny Depp).


From a revenue perspective, the collaborative Alice in Wonderland generated as much as the Pirate of the Caribbean series for Johnny, whereas that same movie and Charlie and the Chocolate Factory were Tim Burton's two biggest revenue-generating movies.



Closing conclusions


If  were to summarize the previous findings in one sentence, it would be that Johnny Depp is currently repaying Tim Burton for having made him known in Hollywood early on his career with great-rated movies (Edward Scissorhands and Ed Wood) by starring in two huge hits (dollar-wise).

All this being said, it will be very interesting to see how well Dark Shadows performs and how it fits in with the current trends...