Alex Diaz


Linear regression between win percentage and shot metrics

One way to gauge shot metrics is to measure their relationship to wins. We’ll use data from the last four full seasons: 2009-2010 to 2013-14, excluding the lockout-shortened season. Running the analysis and including the shortened season does not change the overall conclusions, but the relationships between all variables come out slightly weaker. Two games are missing from the dataset (an OTT-BUF game from 2009-10 and a WSH-CAR game from 2010-11).

Variables used

The data considered is during regulation time only. Additionally, we will be measuring regulation win percentage instead of the usual win percentage; teams that win in overtime or the shootout are not considered to have a regulation win. Corsi, Fenwick, and Shot Percentage are defined as usual: (Shot attempts for) / (Shot attempts for + Shot attempts against). When using Home/Away as a factor (dummy/indicator variable), the dataset is split and the win percentages refer to home and away regulation win percentages. Because of the two missing games, statistics for the teams involved will be slightly different from calculations using the NHL’s official results.

In all cases, the best predictor by far was the percentage of goals scored. This should not be surprising as winning is defined by outscoring your opponent. However, since goals are fairly rare, we would like to use more common events in analysis; goals are included for the sake of comparison, but we won’t dwell on their predictive value.

First case: All situations

We use all game data and don’t differentiate between home and away. The model being proposed is REGULATION WIN % = SHOT METRIC % + ε

Metric(s) R2 Adjusted R2
Corsi % 0.2652 0.259
Fenwick % 0.2818 0.2757
Shot % 0.2897 0.2836
Goal % 0.8253 0.8238

Second case: All situations, split by home and away

We use all game data, but split the season into home and away games. The model being proposed is REGULATION WIN % = SHOT METRIC% + HOME AWAY STATUS + ε.

Metric(s) R2 Adjusted R2
Corsi % + Home/Away 0.298 0.2921
Fenwick % + Home/Away 0.3124 0.3066
Shot % + Home/Away 0.31 0.3042
Goal % + Home/Away 0.8056 0.804

Third case: Even-strength 5v5 only, split by home and away

We only use game data where the score is tied and both teams are playing at full-strength. The model being proposed is REGULATION WIN % = SHOT METRIC % + HOME AWAY STATUS + ε.

Metric(s) R2 Adjusted R2
Corsi % + Home/Away 0.3493 0.3438
Fenwick % + Home/Away 0.3369 0.3313
Shot % + Home/Away 0.3345 0.3289
Goal % + Home/Away 0.5 0.4958


Without considering a team’s score differential or strength on the ice, the best shot metric is the actual shot percentage, explaining just under 29% of the regression variance. Splitting the dataset by home and away results in better accuracy, with Fenwick percentage slightly more predictive than shot percentage. Reducing our data to only even-strength 5v5 play, we find that Corsi percentage becomes the strongest predictor of regulation win percentage with an adjusted R2 of 0.3438.

This model doesn’t consider save percentage, special teams, or the myriad other aspects that compose a winning team. Considering how much happens in a game, Corsi percentage and home/away status alone act as very useful predictors, lending evidence that even-strength, tied shot attempts make a good metric for analysis.

Regression diagnostics

These are the diagnostics for the full-strength tied regression. The residuals appear to satisfy conditions for normality and homoscedasticity centred about mean zero.

Category: Analytics, Hockey


One Response

  1. […] In association rule learning, a binary dataset like this one can be thought of as a big list of itemsets. For our purposes, our items are players and events, and we’d like to measure how often they appear together. Our motivation is to highlight players’ presences during “good” or “bad” events; in this example, we’ll use Corsi For and Corsi Against, respectively. This technique is not limited to Corsi events — in fact, I’d like to expand it to many other events — but for now they’re the easiest to record and they have predictive value. […]