It all started in baseball, when Bill James found a very elegant formula linking a baseball team's winning percentage to the number of runs it scored and allowed:

expected win percentage =

runs_scored ^ 2 / (runs_scored ^ 2 + runs_allowed ^ 2)

Because the variables are raised to the second power, the formula became knows as the "Pythagorean expectation formula".

In 1994, Daryl Morey, one of the biggest proponent of analytics in basketball and now GM for the Houston Rockets, adapted the formula for basketball teams. The overall structure remains the same, but the power of 2 was replaced by 13.91. Here's an extract from Daryl's formula in STATS Basketball Scoreboard:

Essentially the same formula as for baseball but with 13.91 as the power:

expected win percentage =

pts_scored ^ 13.91 / (pts_scored ^ 13.91 + pts_allowed ^ 13.91)

In this post, I wanted to further explore this formula and answer questions such as: how accurate is it? it was based on data up until 1993-1994, is it still accurate with today's data? are there other more accurate formulas out there?

To start off, I extracted all relevant statistics by team and by year for the past 15 complete seasons, going from 1999-2000 to 2013-2014.

Let's start by looking at how accurate Daryl's formula is when looking through these last seasons:

Well, formula still applies quite well to say the least! Of course, the exact coefficient might be slightly off so I used the more recent data, and fit the same model. The fitted value for the exponent turned out to be 13.86. Despite all the rule changes over the bast twenty plus years (three free throws on three-point fouls, hand-checking, clear path...) and the fact that the early nineties are regarded as a completely different era of basketball as now (somewhat linked to the rule changes), the value is almost identical, less than a 0.4% difference!

But back to the formula. It seems to perform remarkably well and fitting the data, but can we do better? There is room for additional flexibility: in Morey's formula, all three terms are raised to the same power. What if points scored and points allowed were allowed to be raised to different values?

expected win percentage =

pts_scored ^

**a**/ (pts_scored ^

**a**+ pts_allowed ^

**b**)

Or if all three terms could have different powers?

expected win percentage =

pts_scored ^

**a**/ (pts_scored ^

**c**+ pts_allowed ^

**b**)

When fitting those new more flexible models, it turns out that the fitted coefficients remain very close to 14. Naturally with additional we observe a decrease in residual sum of squares, but nothing extravagant either. We'll revisit this point later in the post.

But let's step back for a minute, what exactly does the exponent value correspond to? For values scored and allowed points ranging from 90 to 110, I generated three charts displaying expected win percentage for the respective exponent values of 2, 14 and 50.

We notice that the value controls how slowly/quickly the surface goes to 0 and 1 as the difference between points scored and allowed increases. When points scored is 100 and points allowed is 95, win percentages are 52% (exp of 2), 67% (exp of 14) and 93% (exp of 50).

But something else stands out in all graphs: they appear invariant in one direction, the first diagonal (going through points (90, 90) and (110, 110)). In other terms, a team allowing 90 points and scoring 93 has an expected winning percentage only very slightly off from another team scoring 110 and allowing 107.

**What truly matters is the delta between points allowed and scored, not the absolute value of these two numbers.**

Let's do some very simple exploratory data analysis looking at actual win percentages against points scored and allowed:

As one would have expected there is definitely some correlation there - especially regarding points allowed (which could be an additional argument for promoting a strong defense over a strong offense, but that is for another time).

But things get really interesting when we look at the difference between points scored and allowed:

You don't often come across a correlation of 0.97 just plotting two randoms against each other in your data set! It looks like someone created a rectangular stencil in cardboard, placed it over tan empty plot, and asked their 3-year old kid to go polk-a-dot-crazy within the rectangular region. Can this strong relationship be leveraged for an alternative formula to Morey's?

A simple linear model begs to be fit, but would it also make sense to add a quadratic or cubic term? Quadratic (or any even number fo that matter) does not seem reasonable: a delta of 5 or -5 suggest VERY different types of performances, so only odd-numbered powers should be considered. Here's the plot with the fits from a single term and with an additional cubic term:

We've now come to the point where we have five models (the three "pythagorean" with various degrees of flexibility which I'll refer to as the single/double/triple power models based on how many coefficients are fit) and two linear models (with and without the cubic term). Can one be established as being significantly superior to the others? Will Morey's formula hold?

Of course the easiest way to compare would be looking at the fits and comparing residual sum of squares, but this will always lean towards the more complex models and yield to the overfitting problems we constantly hear about. So how do we go about it? Simply the way overfitting is dealt with in the abundant litterature: cross-validation. The data is randomly split in to training and testing datasets, the model is constructed based on the training data, but evaluated on the test data never seen before.

And the results are in!

Based on my random splits, it seems's that while all models perform very similarly, Morey's formula (simple power) has a slight advantage. It didn't achieve the minimal RSS, did yield the maximum, but median RSS was lower than all other models, though not significantly.

So after all this work, we weren't able to come up with a better way to reliably and robustly compute expected win percentages than a formula over 20 years old!

In a next post we'll dig a little deeper into the data and try to understand the largest discrepancies. In his original paper, what did Daryl Morey mean when referring to the Chicago Bulls as a lucky team in 1993-1994?