Showing posts with label correlation. Show all posts
Showing posts with label correlation. Show all posts

Friday, February 14, 2014

Easy tips for Hollywood Producers 101: Cast Leonardo DiCaprio! (but cast him quick!)

My wife and I were thinking of going out to see "The Wolf of Wall Street" the other day. Why did this movie catch our eye more than the other twelve or such showing at our local movie theatre?


Had we heard great reviews about it? Nope. Had word-of-mouth finally reached us? Nope. Had we fallen prey to a cleverly engineered marketing campaign? Well yes and no.

Not owning a TV at home, so the least you could say is that our TV ad exposure was quite minimal. And as far as I can remember (although one could argue that this is exactly the purpose of sophisticated inception-style marketing) we din't see that many out-of-home ads nor hear any radio ones. Marketing was involved, but the genius of the marketers behind "The Wolf of Wall Street" was restricted to creating the poster, and not because of the monkey in a business suit nor the naked women, but simply by putting Leonardo DiCaprio right there. My wife and I's reasoning was simply that any movie with Leonardo had to be good.

Now don't go and write us off as Titanic groupies/junkies. I won't deny we both enjoyed that movie, but we don't have posters of him plastered all over our house. But here's the question we found ourselves asking: Can you name a bad movie with Leonardo in it? And harder yet: a bad recent movie with him in it?

Made you pause for a second there didn't it? Few people will argue against the fact that Leonardo is a very good actor and that his movies are generally pretty darn good. But are we being totally objective here? How does Leonardo's filmography compare to that of other big stars? The Marlon Brandos, Al Pacinos, De Niros, Brad Pitts...?

In order to compare actors' filmographies, I turned to my favorite database from IMDB. IMDB has a rather peculiar way of listing actors, directors, producers in its database, and I was unable to find a logic between the individual and the index in the database. But I did notice that all the big actors I wanted to compare Leonardo to had a low index (never above 400), so decided to pull data for all indices less than 1000. Now in the process I got some directors or actors with very few movies, so excluded from the analysis anybody have acted in less than 10 movies. The advantage of pulling this way was the fact that it provided a very wide range of diversity in gender, geography and time. So we have Fred Astaire, Marlene Dietrich, Louis de Funès, Elvis Presley...And to get an even broader picture, I added 30 young rising new stars to the mix. All in all, 826 actors to compare Leonardo to.

Going back to our original question of how good Leonardo is, I've looked at two simple metrics: ratio of movies with an IMDB rating greater than 7, and ratio of movies with an IMDB greater than 8. So how well did Leonardo do? The mean fraction across the actors was 22% for the 7+ rating (median 20%). Leonardo had... 55%! That's 16 out of his 29 movies! Only 15 actors have a higher score. Top of the list? Bette Davis, with 76 of her 91 movies (83.5% having a 7+ rating). The recently deceased Philip Seymour Hoffman also beat Leonardo with 31 out of 52 (59.6%). Fun fact, what male actor of all times has the best ratio here? You have to think out of the box for this one as he's more famous for directing than acting, yet makes an appearance in almost every one of his movies. That's right, Sir Alfred Hitchcock, has 28 of 36 movies (77.8%) rated higher than 7.

Name Number of movies Number of 7+ movies Ratio of 7+ movies
Bette Davis 91 76 83.5%
Alfred Hitchcock 36 28 77.8%
François Truffaut 14 10 71.4%
Emma Watson 14 10 71.4%
Bruce Lee 25 16 64.0%
Terry Gilliam 16 10 62.5%
Andrew Garfield 13 8 61.5%
Alan Rickman 44 27 61.4%
Frank Oz 31 19 61.3%
Daniel Day-Lewis 20 12 60.0%

What about for movies rated higher than 8? Leonardo does even better according to this metric! The average actor has only 2.7% (median 1.7%) of movies with such a high rating. Leonardo has 5 out of 29, 17.2%! And only 8 actors do better with this metric. No more Bette Davis (plummets to 3.2%), but replaced by Grace Kelly (3 out of 11, 27.3%) who tops the chart. Sir Alfred is impressive once again with 9 out of 36 (25%).

Name Number of movies Number of 8+ movies Ratio of 8+ movies
Grace Kelly 11 3 27.3%
Alfred Hitchcock 36 9 25.0%
Anthony Daniels 12 3 25.0%
Chris Hemsworth 12 3 25.0%
Terry Gilliam 16 3 18.8%
Elizabeth Berridge 11 2 18.2%
Elijah Wood 56 10 17.9%
Groucho Marx 23 4 17.4%
Leonardo DiCaprio 29 5 17.2%
Quentin Tarantino 24 4 16.7%

Now the big stars we mentioned earlier do pretty well, just not as good as Leonardo:

Name Number of movies Number of 7+ movies Number of 8+ movies Ratio of 7+ movies Ratio of 8+ movies
Marlon Brando 40 18 4 45.0% 10.0%
Brad Pitt 48 23 6 47.9% 12.5%
Robert De Niro 91 32 8 35.2% 8.8%
Leonardo DiCaprio 29 16 5 55.2% 17.2%
Clint Eastwood 59 19 6 32.2% 10.2%
Morgan Freeman 69 24 7 34.8% 10.1%
Robert Downey Jr. 68 17 1 25.0% 1.5%

Another thing worth repeating to put these numbers in perspective: we are not comparing Leonardo to your "average" Hollywood actor. Because of the way IMDB has matched actors with indices, we are comparing Leonardo to some of the greatest of all times here!

Remember how earlier one we mentioned that it was even harder to find a recent bad movie by Leonardo? Let's look at his movie ratings over time to confirm this impression:


Wow. With the exception of J.Edgar in 2011, every single one of his movies since 2002 (that's over this last decade !) has had a rating greater than 7! 12 movies!

Now one might argue that there is a virtuous circle here: the more you become a star, the easier it is to get scripts and parts for great movies and do the easier it becomes to continue being a super star. For each actor in my dataset, I ran a quick linear regression to see improvement of movie rating over time. Leonardo stands out here quite a bit too, for he is among the rare actors to have positive improvement. The "average" actor's movie lose 0.01 IMDB rating points per year, Leonardo gains 0.1 per year, putting him in the top 15 of the data set:

Name Number of movies Number of 8+ movies Number of 7+ movies Improvement
Taylor Kitsch 11 1 2 0.26
Rooney Mara 11 1 4 0.26
Justin Timberlake 18 0 3 0.23
Chloe Moretz 24 1 6 0.19
Bradley Cooper 26 0 7 0.18
Juliet Anderson 50 1 12 0.17
Mila Kunis 23 1 3 0.16
Chris Hemsworth 12 3 7 0.15
Tom Hardy 27 3 12 0.14
Andrew Garfield 13 0 8 0.12
Mia Wasikowska 21 0 9 0.12
Barbara Bain 14 1 2 0.11
George Clooney 38 1 15 0.11
Jason Bateman 32 2 9 0.11
Leonardo DiCaprio 29 5 16 0.10

What's quite surprising in the last table is that those topping the list in terms of year over year improvement are not the old well-established actors having great choice in scripts, but the new hot generation in Hollywood!

What happens to the megastars? Well let us look at the rating evolution of some of these stars:

Fred Astaire:

Marlon Brando:

Bette Davis:

It appears that they all go through some glory days. Remember Leonardo with his 12 years of 12 movies greater than 7 aside from J. Edgar? Well Bette Davis had 46 such movies, without any exceptions, over a span of 29 years! But not a great way to end a career... Same goes for Fred Astaire and Marlon Brando, started off doing well but end of careers are tough even for big stars, or might I say especially for big stars. Naturally, the hidden question is whether ratings of later movies go down because actors aren't as good as they were, or because good roles don't come as much, because they only get casted for grumpy grandparents in bad comedies. Correlation vs causation...

So back to Leonardo. He's still young, so the primary impulse my wife and I had of "Leonardo's in it so it's got to be good" was not completely irrational, but it might be in 5/10 years from now. Same goes for all the rising top stars. Cast them while they're hot, cause nothing is eternal in Hollywood.



Tuesday, January 15, 2013

Does chocolate consumption increase the number of articles on correlation VS causation?




On October 18th 2012, the New England Journal of Medicine published a paper linking chocolate consumption and Nobel prizes by country. I’m sure you’ve heard about the study, it went pretty much viral after that, people found it cool to post and re-post on Facebook, Twitter…

There have also be countless articles on the topic, and while the methodology and approach of the original paper are highly debatable (and actually there is quite some talk about the paper being a joke from the New England Journal of Medicine), I still found something very troubling in how the results of the paper were paraphrased amongst journalists and friends.

To recall the setup, Franz Messerli (happens to be Swiss), looked at chocolate consumption per capita and number of Nobel Laureates per 10 million people across 23 countries, and found the following results:



All 23 data points, with the slight exception perhaps of Sweden and Germany, seem to fall on a straight line! And Pearson’s correlation coefficient of 0.791 is statistically significant!

I haven’t read the paper myself but the last part seems to attempt to explain the phenomenon, namely by the increased cognitive abilities derived from higher chocolate intake.

But as I mentioned, the headlines of articles reporting the findings hurt me the most, ranging from the conservative interpretation (Does chocolate make you clever?, and Eating chocolate may help you win Nobel Prize) to the absurd leapfrog conclusions (Why Do the Swiss Eat So Much Chocolate?, Secret to Winning a Nobel Prize? Eat More Chocolate).


Of course, journalists need to shoot for the sensational, but the misunderstanding between correlation and causation is a very real and widespread one, and so I thought I would use the Chocolate / Nobel Prize example to illustrate how the two concepts differ.


  • The first hypothesis could be that this was just a random coincidence and that countries not satisfying the “model” were dropped out (think of all the top IQ countries).

  • An alternative (third variable) could be that countries in northern Europe tend to eat more chocolate and tend to have more laureates per capita. Do they have longer and colder winters? Because of the long winters you tend to eat more chocolate to improve morale? Because of the long winters you tend to eat spend more time studying than if you had a warm sunny beach outside? And haven’t sociologists linked long cold winters to increased suicide rates and crimes? (this is what led me to take a closer look at crime-related variables)






To do so, I will first replicate the original analysis as closely as possible, but also look at some additional metrics.

I was not able to pull the original dataset, so searched the web as best as I could to find chocolate consumption numbers and Nobel prizes per capita.


As boasted in the paper, I also got a great correlation, albeit not as significant (0.658 instead of 0.791). First little red flag, it appears that the number is quite volatile and rather data-sensitive. While it is hard to lie about Nobel Prize laureates, one might wonder if there even is anything such as an official chocolate consumption database?

Step 2 consisted in looking at another metric which is theoretically tied to Nobel Prize winning according to the paper: cognitive ability. Of course the first proxy that comes to mind for this metric is IQ. I therefore replicated the analysis looking at correlation between chocolate consumption and IQ.


Wow, the value dropped to 0.279! Even more troubling: only one of the top top IQ countries (Japan) was from the original analysis as there was no chocolate consumption data for them. Second red flag: why were only 23 countries included in the original paper?
So chocolate makes you smart enough to win a Nobel, but not enough to increase IQ?

But why settle on Nobel prize laureates and IQ, why not look at a whole set of metrics and test their correlations with chocolate consumption? With feminine intuition from my wife, I started focusing on crime-related metrics, and here are the results:

Kidnappings, correlation = 0.11


Drug offenses, correlation = 0.42


Rape victims, correlation = 0.45


What about total crimes per capita?



Correlation of… 0.72 !!! Even higher than for Nobel Prize! Stop eating that chocolate right now, or you will end up in jail before you know it!

So taking a step back, what have we shown? Well the first point is that if you compare enough metrics to each other, you are bound, just out of pure chance to come up with high correlation that does not mean anything. Just like if you flip a coin enough time (in the sense of quit your job right now) you are bound to get 20, 30 even 100 consecutive heads.  Does it mean anything? No, you just lost your whole life for nothing.

The second important point is that there could be some link between the two metrics in question. But it does not mean that metric 1 impacts impact 2! Nor does it mean than metric 2 impacts metric 1! (the article never wonders whether Nobel prize laureates could have enough appetite for chocolate  to lower their countries average consumption value). It could very well be that there is a third metric impacting both metric 1 and metric 2. The example I like best from my stats class is: does carrying a lighter increase your risk of getting lung cancer? Of course not why would it? Even if I told you there was a strong correlation between the two? Hmmmm. But here’s the thing, you are actually without knowing it comparing smokers to non-smokers. Smokers tend to have a greater probability of carrying lighters around, and of having greater risk for lung cancer. But carrying a lighter does cause lung cancer, and having lung cancer certainly doesn’t make you more prone to carrying a lighter!



OK, back to chocolate. What’s going on behind the scenes and behind the nice straight line? As discussed previously, there could be one of two phenomena going on:

Go ahead and eat all the chocolate you want, but hold off on booking that flight to Oslo…

I would also like to point out this article by James Winters and Sean Roberts which features a surprisingly similar analysis to mine (looking at IQ then at serial killers). Nobody copied anybody, I was just relieved to see I wasn’t the only one finding for correlation VS causation!