Wednesday, April 15, 2015

Predicting Playoff Success

From Alan Ryder's Ten Laws of Hockey Analytics:
One important warning - do not confuse correlation with causation.  The former is easy to prove, the latter is quite challenging.  For example, carry-in zone entries yield more scoring chances than do dump-in zone entries.  But this could mean that a carry-in is evidence of better neutral zone puck control rather than a cause of better offensive zone puck control.
Which of these variables do you think is the best predictor of playoff series winners in the NHL between 1984 and 1990?  In other words, if you were betting on matchups back then and could only look up one stat for each team to influence your decision, which is the one that would most frequently point to the eventual victor?
  1. Goals For
  2. Goals Against
  3. Shot Differential
  4. Team Shooting Percentage
  5. Ratio of Shorthanded Goals For vs. Against
It's gotta be #3, right, based on what we know about the importance of possession?  Or maybe #1 or #4, since offence had to be important in a league that was wide open and high-scoring?  Or perhaps that old saw about defence winning championships held true, and it was really #2?  The one that seems most out of place is #5, a variable measuring rare events that doesn't take into account anything that happens during the game's most frequent and important game situation (even strength).

But if we look at the numbers after the jump, we get some surprising results:

Predicting Playoff Series Winners, 1984 to 1990:
  1. Shorthanded Goal Ratio:  70-34-1, 67.3%
  2. Goals For:  68-35-2, 66.0%
  3. Shooting Percentage:  66-39-0, 62.9%
  4. Shot Differential:  62-43-0, 59.0%
  5. Goals Against:  43-61-1, 41.3%
The best predictor of the five was in fact a team's shorthanded goal ratio.  Obviously I cherry-picked the results a bit, there were other variables that were even better predictors, including some of the ones that are well-known to be commonly associated with winning teams.  Yet even they weren't all that much better than a variable that most would probably dismiss as almost completely irrelevant:

Overall Goal Ratio:  77-28-0, 73.3%
Winning Percentage:  75-29-1, 72.1%
Win % in Games Decided by 3+ Goals:  74-31-0, 70.5%
Team PDO:  71-34-0, 67.6%
Shorthanded Goal Ratio:  70-34-1, 67.3%

Obviously teams in the 1980s were not winning playoff series solely on the basis of the number of shorthanded goals they scored, and being good on the PK has never been even close to as important as being good at even strength.  What we're looking at here is an example of a variable that appears to be a great predictor not because of its own impact, but because it is a proxy for other factors that actually lead to success.  During the period, top teams like Wayne Gretzky's Edmonton Oilers (who won a ton of playoff series in the mid- to late-'80s) often used their best players as penalty killers, and those players were good enough to be able to frequently score even when facing a manpower disadvantage.  There was also much less parity, meaning that there were large differences in scoring ability across rosters.  Overall, in an offence-driven league with a huge gap between the top and bottom of the standings, it turns out that good teams could be reliably identified just by looking at shorthanded goals for and against.

It should be noted that there are more advanced statistical methods that can be used instead which might give a better perspective on whether a variable is merely correlated with success or whether it might actually cause success.  However, as Ryder pointed out, it remains difficult to say with certainty what not only causes winning today, but will also keep causing that same winning in the future.

All this is to say that I am in full agreement with other hockey analysts that the NHL's SAP analysis is almost assuredly the result of junk science where somebody dumped every possible variable into a regression model until they got the best possible result.  A success rate of 85% for something as random as the NHL playoffs indicates a pretty clear case of overfitting.  I'm of course interested to see what exactly they included in the model if they ever release the full details, but I'd bet it won't be that much more successful in predicting future winners than much simpler analyses based on a few key factors that not only have a decent track record of predictive success, but are also ones that we intuitively know are likely to be very important to a hockey team's overall results because they have a more direct relationship with goals scored and allowed.  I'm sure some of the 37 components selected by the SAP data scientists are little more than today's version of shorthanded goal ratio, variables that by chance happened to correlate with success for a relatively brief period of time but were never the underlying drivers of anything significant.

After all, if somebody ran my analysis above in 1990 and convinced themselves that shorthanded goal ratio was their secret predictive weapon for NHL playoff wagering, it turns out they would have lost their shirt over the next three years.  From 1991-1993, the team with the better shorthanded goal ratio went just 18-27.

In conclusion, while some sense of history is useful in developing a model, just be aware of the risks of running correlations on dozens of different variables.  Odds are you are going to find something that looks much more important than it actually is, or seems to add better information when it really adds nothing at all.


  1. A model with 37 predictors is inevitablily riddled with cross-correlations. It would be rare for a good model to rely on more than 5 parameters. got to a 4 factor model (actually 3 factors) And has developed a similar 5 factor model.

  2. Found your blog. Its really nice on betting tipsI appreciate your article. Its important to get quality betting tips to earn money. So thanks for sharing all that important information.