The Statisticator: October 2015

"Cleveland with a 10-0 in the last 2:05...."
"Warriors answering with an 8-0 run over the end of the first and beginning of the second quarter..."

I've watched a number of games during this 2014-2015 season, Playoffs included, and couldn't help but notice the amount of runs being announced on screen. This was particularly true when Chicago went on long scoring droughts against Cleveland in the Eastern semifinals.

But a question that nagged me all this time and which I wanted to investigate a little further is whether these droughts - or runs depending on whose side you're on - are natural and expected, or on the contrary are influenced by external factors.

In a previous post I took a closer look at overtimes in the NBA and showed that because they are an equilibrium caught between two highly unstable states (team A losing by a handful of points on one side, team B losing by a handful of points on the other), overtimes are about three times more likely to occur than one would naively expect. Could the same be said for runs? Once a team has gone on an 8-0 run, is it more likely to push it to 10-0? The 8-0 run could be the result of one team having a much better lineup on the floor, or a player with a particularly hot hand (although the notion of hot hand is debatable, one I will probably look into in a future post). Or is the team on the bad end of the run more likely to score, perhaps by calling a timeout to stop the first team's 'mojo' or to set up a specific play with higher scoring probability?

Data
The first step was to collect as much data as possible. I pulled from nba.com all available games (regular season and playoffs but excluding pre-season), all the way back to the 2009-2010 season.
For each game, I split it into a succession of 'runs' and for each I computed the number of possessions and points.
Consider for instance the first few minutes of Game 1 of this year's Western Conference Finals between the Houston Rockets and Golden State Warriors:

After parse the information, we would get something like:

Rockets had a 2-0 run over 45s
Warriors then had a 2-0 run over 15s
Rockets then had a 7-0 run over 1m51s
...

Although points is what is always being reported and what everyone ultimately cares about, I decided to focus on number of scoring possessions instead. A 7-0 which is the results of seven consecutive trips to the freethrow line, with 1 out of 2 freethrow being made each time is very different from a three-pointer followed by another three-pointer on which the shooter is fouled and completes the four-point play. In the former case the scoring team needs to get (at least) 6 defensive stops, in the latter they need only one.

So the data would actually look like this:

Rockets score 2 points on 1 scoring possession
Warriors score 2 points on 1 scoring possession
Rockets score 7 points on 4 scoring possession
...

So the question we are interested in is how many consecutive times can a team score uninterrupted?

Preliminary Graphs
Before we jump into any modeling, let us first look at the frequency of uninterrupted scoring possessions:

That's quite a nice shape! It seems the occurrence of every run is a little under half of the previous number. We can verify this ratio visually:

Indeed, the frequency of each run is a remarkably stable ~45% of the previous run frequency.

A natural question is whether there is a difference between regular season games and Playoff games? Defense is supposed to be cranked up on notch so are longer runs less frequent? In the following graph, the proportion of runs in Playoff games is represented via the red histogram, that of regular season games via the blue histogram, and the overlap is purple.

The little blue tip would suggest a little more runs of 1 scoring possession in the regular season (and hence a little more 2+ scoring possession runs in the Playoffs), but a Chi-Square test reveals no statistical significance in the difference between the two histograms.

We can also split runs by home and road team. Can homecourt provide an additional boost and extend runs?

It appears as if the home team is slightly more likely than the road team to have longer runs, and this time the difference (as small as it appears visually) is significant. Once the home team gets it going and gets the crowd involved, good things happen!

Yet another splitting option is by quarter. Perhaps a team hasn't got its rhythm in the first quarter and is more likely to suffer a run, whereas the defense is a little tougher in the fourth quarter thus limiting scoring opportunities. To avoid overlaying 5 histograms over each other (the fifth being for overtimes), I used lines instead:

Although the lines appear nearly identical on the graph, the Chi-Square did pick up a significant difference across the associated table, even when dropping runs in overtime (harder to get long runs in a 5 minute overtime than a 12 minute quarter).
But it turns out that longer runs are more likely in the fourth quarter than the first. Either the defense gets a little tired, or the losing team realizes that they need to step things up quickly to avoid picking up the L.

Model
Having a better sense of how the runs behave, we can apply a little bit of modeling.
Let us assume that the two teams are of similar strength, and that when they have possession of the ball they both have the same probability p of scoring.
Team A just scored, team B now has possession of the ball. What is team B's probability of interrupting A's run? There are theoretically an infinite number of ways for that to happen, the pattern being quite obvious:

B scores (probability = p)
B misses, A misses, B scores (probability = (1-p)(1-p)p)
B misses, A misses, B misses, A misses, B Scores (probability = (1-p)(1-p)(1-p)(1-p)p)
...
(B misses, A misses) n times, B Scores (probability = (1-p)^(2n) * p)

Reminder: p is the probability of scoring on a team's possession, so it incorporates missing a shot but getting the offensive rebound and shooting again for instance.

Adding all the pieces yields the probability of team A's run to be interrupted:

Let's now look at the probability of extending the run by exactly n more possessions, which we will denote P(n). We will break up this probability as the probability of scoring one more time and then exactly (n-1) times to get the recurrence with P(n-1):

B misses, A scores, A scores exactly (n-1) more times (probability = (1-p)p * P(n-1))
B misses, A misses, B misses, A scores, A scores exactly (n-1) more times (probability = (1-p)(1-p)(1-p)p * P(n-1))
...
(B misses, A misses) n times, B misses, A scores, A scores exactly (n-1) more times (probability = (1-p)^(2n) * p(1-p) * P(n-1))

Adding everything up yields:

And so, realizing that P(interrupted) is actually P(0), we get the general formula:

Let's get a few curves for various values of p:

I've overlaid the empirical curve (blue curve in bold). It's a little difficult to spot as it is extremely close to the curve with p = 25%.

Here's the plot with just those two curves:

The similarity in the two curves is really impressive!
We can even refine the true value by fitting a model to identify the value of p which best fits our empirical data. The result is 24.2%.

But back to our initial problem? Recall that we were trying to determine whether runs occur as frequently as one would expect, or if there are external factors that make them more/less likely? In our model, we assume no such external effects, the probability of any team to score when it has possession of the ball is a constant p, and does not depend on the past (whether team A has scored 0, 1 or 10 consecutive times already, the same way heads will come up 50% of the time with a fair coin even if we've just had a run of 10 heads or ten tails right before). And the fact that under this assumption theoretical and empirical values match so well would suggest that there are no external effects (or perfectly compensate each other!), and that when we observe 8-0, or 11-0 runs we were simply bound to see them occur.

Let's end all the modeling with a fun fact: any idea what the greatest run from these past years has been (with one team remaining scoreless)? 15-0? 19-0? Turns out it was 29-0, by the Cleveland Cavaliers led by LeBron James.... before his Miami days. Over the first two quarters of the game and almost 9 minutes, the Cavs scored 29 consecutive points on 19 possessions over the Milwaukee Bucks on Dec 6th 2009 (a day short of the Pearl Harbor Anniversary!).

The Statisticator

Friday, October 16, 2015

A brief history of NBA runs: Do teams really 'get hot'?