So in addition to boardgames, I am also a big NBA fan, and luckily NBA and stats mix really well.
A topic that has often been covered is how to rank players? Which player has the greatest impact? Which is the greatest player of all time (well, that one's easy ;-) )?
The debates rage because the questions are so vague and open to interpretation. What does it mean for a player to be a better basketball player than another? Does it mean having better stats? If I score more, rebound more, assist more, steal more, turnover less, clearly I am better than you? Another approach is to compare win/loss records when the player plays or sits out, although for most players it will be difficult to have a good sample size for the "sitting out" observations. A new metric that has emerged and solves this "sitting out" problem is the plus/minus statistic, which keeps track of the score before and after a player enters the game. So say a player enters the game with game tied at 10, and leaves it with his team ahead by 5. That's +5 for him. He re-enters the game with his team ahead by 10, and leaves it (without coming back) with his team only ahead by 2. That's -8. With the earlier +5 that's an overall -3 for that player in that game.
Today I wanted to look at a different approach, a more statistical approach. I have no clue where it will lead me, but after different trials and errors and tweaking here and there I hope to come up with a new interesting way to rank players.
Ultimately, what we care most about is wins. Sure it's great to score 100 in a game, but if you lose that game that's just wasted effort. So the idea is to find a relationship between a player's efforts and the impact it has on the game. In other words, how does a player's stats in a game change the probability of winning the game?
In terms of the data, I looked at the past six seasons (not including 2011-2012), and for each player looked at his stats with the game outcome for all games played. I only considered players with at least 50 wins and 50 losses, playoffs not included.
As our variable of interest is a probability (of winning the game), we naturally turn towards a logistic regression. We are not directly modeling the probability as a linear combination of the covariates but rather the log odds: log(P(win) / (1 - P(win))). The interpretation of the coefficients will not be entirely straightforward but will still allow us to rank players. Which player has the greatest coefficient, and has the greatest impact on the log odds and thus the probability of winning the game?
Well it depends on our covariates. Since we do want to find an easy way to rank, it's best to only consider one covariate.
Let's naively only consider points scored. How does scoring an extra point improve the log odds?
The top 5 impactful players are (in order): Calvin Booth, Greg Ostertag, Antonio Davis, Anderson Varejao and Bruce Bowen.
Points might be too restrictive, since a player can have an impact without scoring. So let's consider (points + rebounds + steals + assists + blocks - turnovers), referred from here onwards as "all metrics" as the covariate.
The top 5 impactful players are (in order): Bruce Bowen, Calvin Booth, Eddie Griffin, Kevin Durant, Antonio Davis.
If we were to suspect that a player's impact is difficult to track with simple metrics only, let us take minutes played as a proxy for everything observed and not observed (good defense, good picks...)
In this last case, the top 5 impactful players are (in order): Kevin Durant, Othella Harrington, Gerald Wallace, Eddie Griffin, Zach Randolph.
Where are the superstars?
It's interesting to see the same names come up, and to notice that aside from Kevin Durant, none of the players have superstar status.
Talking of superstars, where are they in the rankings?
Out of the 479 players considered, here is how some superstars ranked respectively for points, all metrics, and minutes played:
Kobe Bryant: 363, 333, 479
LeBron James: 247, 101, 475
Kevin Garnett: 396, 416, 478
As I was mentioning, I am discovering these results in almost real time with you, and still a little unclear how to interpret them myself. There are a lot of things that could hurt the analysis, namely the fact that the coefficients are hear interpreted as "change in log odds for an additional unit increase in the covariate". But an additional point for Kobe isn't exactly the same thing as an additional point for Eddie Griffin.
We also have a case pointed out in "Superfreakonomics" about ranking good surgeons. Looking at patient death rate for instance can be misleading because of selection bais. People with more critical conditions will go see the better surgeon but because of there condition increase the risk of increasing the surgeons death rate because of the very critical condition. Bad doctors only seeing healthy patients will have impeccable track records. Similarly in basketball, it could be argued that when the game is on the line you will go to your superstars that will have to play exceptionally well to win the game, whereas you might put all your bench in the game when the game has already been won for a while.
There is definitely room for improvement, but I will continue to explore this approach to try to identify lesser known players that have strong yet unnoticeable impacts on the game.