Alex Diaz


Data Mining the Play-By-Play

At long last, my thesis is complete:

Data Mining the Play-By-Play

Overall summary:

  • Linear regression models demonstrating:
    • Full-strength tied Corsi% as a useful predictor for win percentage and goal percentage (“Full-strength tied Corsi” refers to a team’s shot attempts when the game is tied and with five skaters vs five skaters)
    • The surprisingly distinct contributions of powerplay and the penalty kill
  • Pretty graphs! I mean, “data visualization”.
  • A demonstration of how association rule learning and interest measures can be used to evaluate players and player combinations

Less technical summary:

  • If your team regularly has more shot attempts than opponents while the game is tied and while at full strength, that’s good
  • The powerplay and penalty kill don’t seem to overlap much and have roughly equal value in explaining win percentage
  • Pretty graphs! I mean, “data visualization”.
  • Ever wondered who really carries the second line? A machine learning technique called “association rule learning” can help tease out the details to see which players tend to contribute more (and which ones should be bumped from the second line to the first).

Measuring Corsi For and Against with Association Rules

I’ve previously discussed potential applications of association rules learning in the second half of a post from last year. The arulesViz package for R is broken for me right now, so I won’t be able to recreate the graph at the bottom without considerable effort. For now, we’ll use graphics I’ve generated on my own. While I’ve created some interesting results, I would like to strongly emphasize that this is essentially a prototype, so going forward one must remember the assumptions being made and the contexts of the statistics being analyzed.

Last weekend I presented some of my findings at Ottawa Hockey Analytics, and what I’m writing about today is largely the same. My data comes entirely from the NHL’s Play By Play (PBP) and Player TOI tables. For this work it suffices to use PBP data on its own; I find Corsi events reliable at correctly recording the players present on the ice. However, I originally intended this dataset to have other purposes, so I stripped the PBP of its player information and added it using TOI tables. There may be some errors as a consequence (e.g if shifts were not properly recorded). I organized the data into binary columns for every event and player. If a player is on the ice during an event, he is marked as a 1; those off the ice are marked as 0s. Events are recorded similarly. This is a sample of what I end up with:

Transaction dataset example (201314 LA)


In association rule learning, a binary dataset like this one can be thought of as a big list of itemsets. For our purposes, our items are players and events, and we’d like to measure how often they appear together. Our motivation is to highlight players’ presences during “good” or “bad” events; in this example, we’ll use Corsi For and Corsi Against, respectively. This technique is not limited to Corsi events — in fact, I’d like to expand it to many other events — but for now they’re the easiest to record and they have predictive value.

Before showing the results, I’ll provide a quick primer:

  • An itemset X is a a collection of items. It’s basically a row in our dataset. In the screenshot, our first itemset {ANZE_KOPITAR, JONATHAN_QUICK, DWIGHT_KING, CORSI_FOR}. Eventually I removed the goaltenders – I believe they have analytic value, but for now I’m keeping it simple. Well, simple-ish.
  • A rule X => Y is an implication between itemsets X and Y, written as X => Y.
    • Rules are split into the left-hand side (LHS, or “antecedent”) and right-hand side (RHS, or “consequent”)
    • Here, we only consider players on the LHS and events on the RHS
    • For example, our rules will appear as {ANZE_KOPITAR, DWIGHT_KING} => {CORSI_FOR}

The following metrics are interest measures. As the name implies, they are meant to highlight interesting relationships between variables.

  • The support of an itemset X is defined as the proportion of the database in which X appears
    • For us, a higher support means that the combination of players and events happens more often. Players with more ice time will necessarily have higher support simply because they’re on the ice when more things happen
  • The confidence of a rule X => Y is the ratio of the support for X and Y to the support of X alone.
    • CONF(X => Y) = SUPP(X, Y)/SUPP(X)
  • The lift of a rule X => Y is the ratio of the support for X and Y to the product of the supports of X and Y individually
    • LIFT(X => Y) = SUPP(X, Y)/[SUPP(X)*SUPP(Y)]
  • The difference of confidence of a rule X => Y is the difference of confidence between X => Y and ¬X => Y
    • DOC(X => Y) = CONF(X => Y) – CONF(¬X => Y)

Interest measures can be interpreted in a probability context. If we restrict our sample space to all Corsi events over an entire season, we are measuring the probability that Corsi events occur with respect to the players on the ice. Letting X = {Players on ice} and Y = {Event}, we get:

  • SUPP(X, Y) = P(X, Y) = Probability that a Corsi event and a particular combination of players occurs
  • CONF(X => Y) = P(X, Y)/P(X) = P(Y|X) = Probability that Y occurs, given that player combination X is on the ice
  • LIFT(X => Y) =P(X, Y)/[P(X)P(Y)]. Statistical independence between events X and Y is defined as P(X)P(Y) = P(X, Y), so the lift’s closeness to 1 could be used to indicate independence.
  • DOC(X => Y) = P(Y|X) – P(Y|¬X) = Probability that an event occurs given player combination X is on the ice – Probability that an event occurs given player combination X is not on the ice
    • Note that X is the entire combination of players. If X = {KOPITAR, KING}, then ¬X = {All sets without both of KOPITAR, KING}. Thus, ¬X includes all events where: (1) Kopitar is on the ice but King is not; (2) King is on the ice but Kopitar is not; and (3) Neither Kopitar nor King are on the ice

With all that in mind, here is a description of what I’m working with:

  • The dataset contains all even-strength 5v5 tied Corsi events from 2013-14
  • All Corsi events are treated equally. I haven’t made adjustments for quality of competition,
  • We are limited to how often player-event combinations occur within a team. There are two sides to this: the first is how often a situation exists; the second is how often a player is in the situation to begin with
  • All analysis was done within teams. Comparing between teams may not be useful.

I believe the best application of this analysis is to compare how players are faring in their current roles on their teams. We can highlight player chemistry, their performances relative to teammates in the same position, and (hopefully) hidden potential. Over time, we may also be able to use multiple seasons to see how a player has performed with different teammates and team strengths.

I’ve generated six graphs for each team:

  1. All individual performances
  2. Defensemen, as individuals
  3. Defensemen, as pairs
  4. Forwards, as individuals
  5. Forwards, as pairs
  6. Forwards, as trios

Support for Corsi Against is on the left; support for Corsi For is on the right. More green means that the Difference of Confidence is higher for that statistic, implying that the event is more likely give that the specific player combination is on the ice. More purple is the opposite (an event is relatively less likely given that player combination). Rules with fewer than 25 occurrences were dropped, so if a column is missing on one side it’s probably because it missed that threshold.

Important note: The initial batch of images I uploaded on Feb 13th had some errors based on when players were on the ice. I have since fixed the error in my code but I won’t be immediately redoing the analysis for each team. Here is a direct link to the album.

Examples and discussion

Toronto Maple Leafs, 2013-14:

  • Possession numbers are terrible across the board
  • Phaneuf is on the ice for about a fifth of even-strength tied shot attempts against with a very strong difference of confidence against him
  • Jake Gardiner and Morgan Rielly have the strongest indication in favour of possession among defensive pairings
  • Forward possession is driven by Kessel and Van Riemsdyk. Looking at forward pairs, it suggests the Kessel and JVR have stronger chemistry than either combining with Bozak
    • Kadri’s support for Corsi For/Against are better when paired with either winger, and with stronger difference of confidence, suggesting he could be a better first line centre
  • Lupul and McClement had awful years

Los Angeles Kings, 2013-14:

  • The top pairing of Doughty-Muzzin is strong. The difference of confidence at an individual level is much higher in Muzzin’s favour, though he and Doughty don’t seem to have shared defence partners at any point
  • Anze Kopitar drives play strongly
  • Tyler Toffoli had a strong year with a variety of linemates

Linear regression between win percentage and shot metrics

One way to gauge shot metrics is to measure their relationship to wins. We’ll use data from the last four full seasons: 2009-2010 to 2013-14, excluding the lockout-shortened season. Running the analysis and including the shortened season does not change the overall conclusions, but the relationships between all variables come out slightly weaker. Two games are missing from the dataset (an OTT-BUF game from 2009-10 and a WSH-CAR game from 2010-11).

Variables used

The data considered is during regulation time only. Additionally, we will be measuring regulation win percentage instead of the usual win percentage; teams that win in overtime or the shootout are not considered to have a regulation win. Corsi, Fenwick, and Shot Percentage are defined as usual: (Shot attempts for) / (Shot attempts for + Shot attempts against). When using Home/Away as a factor (dummy/indicator variable), the dataset is split and the win percentages refer to home and away regulation win percentages. Because of the two missing games, statistics for the teams involved will be slightly different from calculations using the NHL’s official results.

In all cases, the best predictor by far was the percentage of goals scored. This should not be surprising as winning is defined by outscoring your opponent. However, since goals are fairly rare, we would like to use more common events in analysis; goals are included for the sake of comparison, but we won’t dwell on their predictive value.

First case: All situations

We use all game data and don’t differentiate between home and away. The model being proposed is REGULATION WIN % = SHOT METRIC % + ε

Metric(s) R2 Adjusted R2
Corsi % 0.2652 0.259
Fenwick % 0.2818 0.2757
Shot % 0.2897 0.2836
Goal % 0.8253 0.8238

Second case: All situations, split by home and away

We use all game data, but split the season into home and away games. The model being proposed is REGULATION WIN % = SHOT METRIC% + HOME AWAY STATUS + ε.

Metric(s) R2 Adjusted R2
Corsi % + Home/Away 0.298 0.2921
Fenwick % + Home/Away 0.3124 0.3066
Shot % + Home/Away 0.31 0.3042
Goal % + Home/Away 0.8056 0.804

Third case: Even-strength 5v5 only, split by home and away

We only use game data where the score is tied and both teams are playing at full-strength. The model being proposed is REGULATION WIN % = SHOT METRIC % + HOME AWAY STATUS + ε.

Metric(s) R2 Adjusted R2
Corsi % + Home/Away 0.3493 0.3438
Fenwick % + Home/Away 0.3369 0.3313
Shot % + Home/Away 0.3345 0.3289
Goal % + Home/Away 0.5 0.4958


Without considering a team’s score differential or strength on the ice, the best shot metric is the actual shot percentage, explaining just under 29% of the regression variance. Splitting the dataset by home and away results in better accuracy, with Fenwick percentage slightly more predictive than shot percentage. Reducing our data to only even-strength 5v5 play, we find that Corsi percentage becomes the strongest predictor of regulation win percentage with an adjusted R2 of 0.3438.

This model doesn’t consider save percentage, special teams, or the myriad other aspects that compose a winning team. Considering how much happens in a game, Corsi percentage and home/away status alone act as very useful predictors, lending evidence that even-strength, tied shot attempts make a good metric for analysis.

Regression diagnostics

These are the diagnostics for the full-strength tied regression. The residuals appear to satisfy conditions for normality and homoscedasticity centred about mean zero.

Illustrating score effect by period in the NHL

It’s well-established that a team’s shooting numbers vary based on the period and the state of the game. As a team falls behind, it plays more aggressively, and vice versa. You can see it clearly in these graphs from the last five seasons. Each graph measures the league average even-strength Corsi percentage for the entire season, by score difference (down by two or more, down by one, tied, up by one, up by two or more)

Two consistent patterns appear in every season:

  1. Possession declines with a lead and increases with a deficit.
  2. The first and second periods are very similar, but in the third period there is a remarkable change in the magnitude of the differences in possession

Note: The 2012-2013 season was shortened by a lockout; while the overall conclusions are the same, the differences in the third period are much starker.

Who pays the iron price? Goalpost and crossbar hits since 2009-2010

PING! “AWWWwwwwww…”, the crowd rumbles. Depending on which side took the shot, the arena is either flooded with relief or tearing their hair out at a missed opportunity.

Is your team cursed? Do they dance with Lady Luck? Are you suspicious that the nets are ever-so-slightly smaller in some arenas? I have no idea, but luckily the NHL tracks posts and crossbars as missed shots in their play by play pages. As usual, there is some grumbling about the accuracy of their numbers, but for our purposes we’ll take them at face value. The graphs below measure the ratio of goal posts and crossbars hit to shots on goal – that is, shots that hit the net. This is a rough measurement of which NHL teams hit the post or crossbar most often.

Offensive posts

IronFor2009-10 to 2013-14

Over the last five full seasons, the Sabres and the Avalanche were relatively lucky, respectively hitting iron just 14.7 and 14.8 times for every 1000 shots on goal. The Senators fared the worst in the league with 22.3 bouncing away from the net. The league ratio (highlighted in red) was 18.6 post/crossbar hits per thousand shots.

Below we have each season individually.

The worst ratios come from the lockout season, which could be a result of sample size from a shortened season. In 2012-2013, there were 19.7 shots off the iron for every 1000 on goal; the Canucks managed to ring a whopping 31.1 per 1000. Conversely, the 2009-2010 Sabres escaped with just 10.2 for every 1000.

Defensive posts

IronAgainst2009-10 to 2013-14

As before, the league saw 18.6 posts or crossbars for every 1000 shots. The Bruins got the least help from their nets, which rang only 14.7 times per 1000 shots, while the Jets (with their somewhat truncated sample) heard 21.8 per 1000.

Again, each season presented individually:

The lockout season once more gives us the highest number: the 2012-2013 Winnipeg Jets saw 26.7 shots hit iron for every 1000 that hit the net. The 2012-2013 Penguins got no help from their nets, seeing a paltry 8.6 posts or crossbars for every 1000 shots.


It’s possible that measuring defensive goalposts has different interpretations from offensive goalposts. When we look at offensive statistics, players are facing different goaltenders and teams regularly, whereas defensive statistics are more or less the same goalies. It could be that certain goalies are more likely to catch pucks that would normally hit iron, or that they (or their defensemen) force players to miss the net entirely because of their positioning. Whether this has a measurable outcome on a game is anyone’s guess. If the numbers are accurate and hitting iron turns out to be mostly luck, then some teams face a swing of a dozen or more goals each year, balancing their fates on a half-inch bounce in the right direction. Hey, sometimes that’s all you need.

The worst superstition in the modern NHL

Hockey is full of superstitions: Don’t step on your dressing room logo; Don’t shave in the playoffs; And most of all, your captain must never touch the trophy after winning the Conference Final. That last one is easily the most meaningless, and yet this marks the fourth consecutive Stanley Cup Final where neither team captain touched their respective trophy.* The logic behind it is that the only trophy that matters is the Stanley Cup, and you don’t want to jinx yourself. Don’t touch the trophy. Don’t even acknowledge its existence. As far as you know, the Campbell Bowl and Prince of Wales Trophy are Sirens of hockey, tempting innocent men to doom in the next round.

The NHL website states that since 2001, teams that touch their trophies are 4-5 in the Final. Two of those matches (New Jersey vs Colorado, 2001; Carolina vs Detroit, 2002) were between teams where both captains made the agonizing decision to make physical contact with a prize they’d earned. I decided to go a bit further, back to 1997. Since then, there have been seven matches between teams where one touched the trophy and the other did not. They occurred in 1997, 1999, 2003, 2004, 2007, 2009, and 2010, and the team that touched the trophy has a 4-3 record. Not exactly damning. There have also been four years where both teams touched — 1998, 2000, 2001, and 2002 — so obviously this superstition doesn’t hold much sway.

In fact, not touching the trophies is a pretty recent phenomenon. Since 1997, there have been five Finals (excluding the current one) where neither team touched the trophy: 2006, 2008, 2011, 2012, and 2013. Some players aren’t even consistent in what they do. In the 2008 match between Pittsburgh and Detroit, neither Crosby nor Lidstrom made contact with their trophies, and Pittsburgh lost. The next year the teams met again, but this time Crosby had grabbed the Prince of Wales Trophy, took a photo with Malkin and Gonchar, and then skated around with it. Pittsburgh went on to win. Other teams skate around the issue, figuratively and literally: In 2011 Boston won the trophy and didn’t touch it. Instead, they did this:

The next round, they beat the Canucks in seven games. While it doesn’t count as touching the trophy, it’s certainly not ignoring it. It all boils down to this: If the Hockey Gods truly exist, they’re pretty lenient.

*The New York Rangers have the distinction of not having a captain this year. Regardless, they avoided the Prince of Wales Trophy.

Correlations between hockey variables

Lots of people ask the basic question: are the Corsi and Fenwick statistics any good? The logic behind them — that teams that control the puck more are more successful– is sound, but often unexamined. One way to judge their usefulness is to check how they relate to other variables, like goals for/against, or how they correlate to winningness relative to another statistic like save percentage.

Pairwise correlations

Pairwise correlations compare two variables at a time. The diagonal (with the histograms) shows the distribution of a particular variable as well as its name. The left diagonal has a scatterplot of the two variables above it and to its right; the right diagonal prints the correlation coefficient for the same variables, with the font size increasing for larger correlations. An obvious example is the comparison between the Fenwick% and Corsi%: the correlation coefficient is 0.97, and the scatterplot forms a nearly perfect line — telling us that the Corsi% and Fenwick% are very, very closely related.

Looking at the rightmost column tells you how each of the variables relates to the percentage of possible points a team could earn. The Corsi% has a coefficient of 0.51, and the Fenwick 0.54.  Save percentage is also at 0.51, meaning that a high save percentage correlates to winningness about as strongly as puck possession does. The top row relates to outhitting your opponent. I’ve discussed that a bit in my first post, so it suffices to say that if it’s a useful metric at all, it tends to be negatively related to winning.

The two remaining variables are Goals Against Per Game (GAPG) and Goals For Per Game (GFPG). The obvious conclusions are evident: GAPG is very negatively correlated with earning more points (-0.74), whereas GFPG is very positively correlated (0.62). The less obvious conclusions are there too, telling us that puck possession is positively correlated with GFPG (0.32 and 0.31 for Fenwick% and Corsi%, respectively) and negatively correlated with GAPG (-0.50 and -0.49 for Fenwick% and -0.5 for Corsi%, respectively). All of this demonstrates that puck possession stats look to be good predictors of success.

Controlling for save percentage

In my first post I uploaded graphs that showed a strong link between puck possession and success in the regular season and playoffs. The issue with that, though, is that some teams had very good possession numbers, yet didn’t qualify for the playoffs, while others achieved the opposite. The 2010-2011 Bruins had an even-strength Corsi of 50.73% — fairly average — and yet they won the Stanley Cup. Last year’s Toronto Maple Leafs, however, managed to have possession numbers among the worst in the last five years, but still qualified for the playoffs. One explanation that gets thrown around is save percentage — and it turns out it’s a decent one.

Controlled Save Pct

Blue dots qualified for the playoffs, red dots are Stanley Cup champions.

This graph plots the percentage of possible points that teams earned in a regular season against the Corsi%. The graphs are split into six groups based on a team’s save percentage. The teams with the lowest save percentages (roughly below .910) are in the bottom left, and the teams with the highest at the top right (roughly above 0.928). The main things to notice here:

  • The teams with the lowest save percentages necessarily need a good Corsi to qualify for the playoffs. As the save percentages get lower , teams with lower Corsi percentages have a harder time making the playoffs — you can see this by the positions of the black dots (non qualifying teams) in the bottom row.
  • Stanley Cup champions have had middling goaltending in the regular season (Chicago, 2009-2010), but they need damn good possession stats
  • Teams with consistently elite goaltending (Boston, 2010-2011) can win the Stanley Cup with a fairly average Corsi

What’s the moral of the story then? Good puck possession and a solid goalie are keys to winning (duh). If a team is weak in one of the two areas, though, then they must counterbalance with strength in the other, especially if the weakness is possession.1 A team with terrible possession stats might sneak into the playoffs, but don’t expect them to go anywhere without their goalie stealing the Cup.

1If your Fenwick % is floating around 0.45, you should probably work on that instead of hunting down Dominik Hasek for his blood.

Measuring the value of Crosby, Getzlaf, and Giroux to their teams

Last week, the Hart Memorial Trophy candidates were announced. According to the infallible internet, it’s likely that Crosby will win — but let’s figure out if that’s true. To get the obvious stats out of the way, Crosby (36G, 68A, 80GP), Getzlaf (31G, 56A, 77GP), and Giroux(28G, 58A, 82GP) finished first, second, and third in scoring this year, with points-per-games of 1.30, 1.13, and 1.05, respectively. Pittsburgh, Anaheim, and Philadelphia finished with 242, 263, and 233 goals, respectively. Since the Hart looks at a player’s value to his team, it makes sense to look at his contributions to the team’s overall scoring.

Pie charts are terrible.

Candidate’s points (in teal) as a proportion of a team’s overall goals.

Looking at points alone, Crosby has a pretty huge head start over the other two. Now, the Hart (allegedly) isn’t the Art Ross 2.0, so it makes sense for us to look at possession statistics and the some frequencies from association rule mining.

Team Strength Candidate Corsi % Fenwick %
Pittsburgh All On 0.6036322 0.6095969
Pittsburgh All Off 0.4225558 0.4267751
Anaheim All On 0.5423348 0.5500000
Anaheim All Off 0.4812510 0.4914966
Philadelphia All On 0.5998837 0.5955325
Philadelphia All Off 0.4552777 0.4539683
Pittsburgh Even On 0.5308595 0.5372152
Pittsburgh Even Off 0.4638907 0.4685562
Anaheim Even On 0.5193694 0.5249392
Anaheim Even Off 0.4928212 0.4995057
Philadelphia Even On 0.5437117 0.5356383
Philadelphia Even Off 0.4811052 0.4763085

You may have noticed that Crosby and Giroux have a larger impact on the ice for their team than Getzlaf. These differences become much clearer when they’re visualized.

Fenwick strength status

At this point it becomes clear that if there’s any competition, it’s between Crosby and Giroux. While all of the players improve their teams’ performances, it’s obvious that Getzlaf’s relative contribution is not as strong as either of the other two.

Looking deeper, the Fenwick percentage at even-strength tilts the odds further towards Crosby and quite a bit farther away from Getzlaf. The next combination of graphs compares the game-by-game Fenwick.

Hart multiplot even

Blue and red lines represent season averages with and without the player on the ice, respectively. Black lines are the team average.

Two things to notice here: (1) Pittsburgh’s possession stats with Crosby are higher than Philadelphia’s with Giroux; and (2) Pittsburgh possession stats without Crosby are lower than Philadelphia’s without Giroux. This is especially evident when you look at the gaps between points — Giroux is very good, but Crosby absolutely lifts his team. This caught me a bit off guard, since until I wrote this post I hadn’t noticed that Pittsburgh finished the regular season with Corsi and Fenwick percentages below 0.500.

When we look at association rules, we the same story being told, albeit in a different manner.

Rank Player Event Support Confidence
1 Any Shot for 0.155 0.155
2 Any Shot against 0.149 0.149
3 Any Hit for 0.137 0.137
4 Any Hit against 0.133 0.133
5 Any Block for 0.075 0.075
6 Sidney Crosby Shot for 0.073 0.201
7 Any Block against 0.069 0.069
8 Any They miss 0.062 0.062
9 Chris Kunitz Shot for 0.062 0.201
10 Matt Niskanen Shot for 0.061 0.178
Rank Player Event Support Confidence
1 Any Shot against 0.150 0.150
2 Any Shot for 0.149 0.149
3 Any Hit for 0.130 0.130
4 Any Hit against 0.125 0.125
5 Any Block against 0.077 0.077
6 Any Block for 0.072 0.072
7 Claude Giroux Shot for 0.064 0.188
8 Braydon Coburn Shot against 0.061 0.175
9 Any We miss 0.060 0.060
10 Jakub Voracek Shot for 0.057 0.200

The main takeaway from Crosby’s table is how high up his generation of offense is — about 7.3% of all active events in the game are a Pittsburgh shot on goal while he’s on the ice, and when he’s on the ice, there’s 20.1% chance that the active event will be a Pittsburgh shot hitting the net. In fact, Crosby was on the ice for a Pittsburgh shot on goal more often than any player was on the ice for an opponent having their shot blocked. Crosby’s linemate Kunitz is only on for 6.4% of Pittsburgh’s shots, so that suggests that Crosby is doing quite a bit on his own. (As a note, you’ll see similar stuff for players like Erik Karlsson, who tend to be head and shoulders above their teammates, even if their teammates are very skilled on their own.)

What’s the conclusion here? In terms of relative contributions to their teams, this is a race between Crosby and Giroux — one that Crosby will very probably win.

Mining and analyzing five seasons of NHL data

As part of a personal project, I started scraping regular season NHL play-by-play data from 2009/10 til 2013/14. I took pages like this one and make them readable to a computer, meaning I could work with the data at a really low level. It’s a detailed dataset to work with; any time an event (e.g. faceoff, hit, shot, etc.) happens, you get a list of the players from each team, their strength, and other game-related data. Mining the data was only a minor pain in the ass thanks to the Python library Beautiful Soup. After formatting the data and running it through R, I got some pretty graphs out of it. Oh, I also found some neat statistical relations.

First, I looked at the Corsi and Fenwick, since they’re new and exciting and discussed frequently. They’re the difference of the number of attempted shots between your team and the opposing team. The Corsi counts all attempts, and the Fenwick excludes blocked shots. Full details are available at Pension Plan Puppets (Corsi, Fenwick).

The results are pretty cool (see if you can spot the Edmonton Oilers!):

Percentage of possible points by Corsi percentage (2009/10-2013/14).

For a challenge, try to spot the Edmonton Oilers!

The line of best fit shows the linear relationship between the variables — generally, a higher Corsi/Fenwick means a team is more successful at earning points.

I did something similar for hits percentage. A team with a hit percentage of 0.500 hits exactly as often as its opponents, whereas a team below 0.500 gets out-hit. There’s a strong negative relationship between hitting and the possession stats. This makes sense intuitively, since if you’re hitting then you probably don’t have the puck.

Corsi percentage vs Hit percentage

The surprising part for me is that three of the last four Stanley Cup winners were out-hit in the regular season. Of course, this comes with all sorts of pitfalls – the definition of a hit is loose, and it’s possible that there’s a lot of bias working its way in. For all we know, crappy teams might have scorekeepers who count too many hits for the home team, and good teams might have the opposite situation. Regardless, it’s interesting.

Association rules

All of this is still game-level or season-level data though. The real fun of play-by-play data is that we can do stuff like association rule learning on it. One of its most well-known applications is market basket analysis. Let’s say you own a grocery store and want to know what people buy together. A basket would have something like milk, eggs, and bread in it. You can then create a rule: {milk,eggs}=>{bread}, meaning that the presence of milk and eggs is associated with the presence of bread. If you do this for every basket, you’ll see some rules come up more often than others. Rules like {milk,eggs}=>{bread} and {chips,cola}=>{salsa} would appear more often than, say, {sausage,bacon}=>{halal chicken}. You can use different measures to answer questions like: “Would higher sales of nachos increase sales of salsa?” or “Is someone more likely to buy Advil if they’re buying diapers?”. With play-by-play data we can do the same for players and events, like {Phil Kessel, James van Riemsdyk}=>{Shot taken}, or {Marc-Andre Fleury}=>{Comically reckless puck play}.

For an example, we’ve got the 2013-14 Toronto Maple Leafs. The support is how often an event occurs, out of all events in the dataset; the confidence is the chance that an event happens if a particular set of players are on the ice; and the lift measures the support for the event and the players being independent, based on how close the lift is to 1. For the Leafs, the most probable player-event is that Dion Phaneuf is on ice during an opponent’s shot. Next, it’s Phil Kessel being on the ice when the Leafs have a shot.

Rank Player(s) Event Support Confidence Lift
1 DION_PHANEUF SHOT_AGAINST 0.06218755 0.1680772 1.0388827
2 PHIL_KESSEL SHOT_FOR 0.05619953 0.1645753 1.3070524
3 JAMES_VAN_RIEMSDYK SHOT_FOR 0.05378234 0.1624896 1.2904881
4 CARL_GUNNARSSON SHOT_AGAINST 0.05361754 0.1782648 1.1018523
5 JAMES_VAN_RIEMSDYK SHOT_AGAINST 0.05356260 0.1618257 1.0002423
6 PHIL_KESSEL SHOT_AGAINST 0.05092567 0.1491313 0.9217781
7 CODY_FRANSON SHOT_AGAINST 0.05004670 0.1556467 0.9620497
8 DION_PHANEUF SHOT_FOR 0.04916772 0.1328879 1.0553920
9 JAKE_GARDINER SHOT_AGAINST 0.04905785 0.1444750 0.8929978
10 PHIL_KESSEL, JAMES_VAN_RIEMSDYK SHOT_FOR 0.04823381 0.1760931 1.3985262

The table below ranks player-events by confidence. If Kessel, JVR, Phaneuf, and Franson are all on the ice, there’s a 25.53% chance that the event is the Leafs taking a shot. The lift is 2.26, which means that the combinations of players and the event is probably not a coincidence. If you want a simpler example, look at Colton Orr: when he’s on the ice, there’s a 24.79% chance that an event will be a Leaf (probably him) making a hit. The lift is 1.74, so the fact that the Leafs are hitting is likely because he’s on the ice. Looking at the table above, Dion Phaneuf is often on the ice when there’s a shot against, but the lift is close to 1 — my interpretation is that he’s on the ice so often that he’s bound to be around when they’re stuck in their own end.

Number Player(s) Event Support Confidence Lift
2 PHIL_KESSEL, DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01098720 0.2824859 2.243495
4 COLTON_ORR HIT_FOR 0.01587650 0.2478559 1.740633
5 DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01241554 0.2364017 1.877495
9 DION_PHANEUF, JAY_MCCLEMENT SHOT_AGAINST 0.02010658 0.2067797 1.278102
10 TYLER_BOZAK, PHIL_KESSEL, CODY_FRANSON SHOT_FOR 0.01395374 0.2055016 1.632088

Looking at the events alone can give a decent overview of a team’s overall strategy. For the Leafs, it was some combination of getting outshot and hoping your goalie keeps you in the game, while outhitting your opponents and hoping that generates offense somehow. One thing to note is just how badly the Leafs get outshot. Out of all events considered, about 30.25% are the other team taking a shot, versus 23.41% for the Leafs taking a shot. This suggests that the Leafs spend a lot of time playing in their own defensive zone.

Rank Event Support
1 SHOT_AGAINST 0.16178652
2 HIT_FOR 0.14239411
3 SHOT_FOR 0.12591331
4 HIT_AGAINST 0.12036478
5 THEY_MISS 0.07059276
6 BLOCK_FOR 0.07020821
7 BLOCK_AGAINST 0.06086909
8 WE_GIVE 0.04900291
9 WE_MISS 0.04751964
10 THEY_GIVE 0.04482778

One last thing you can do is visualize these rules with the R package arulesViz. This organizes the rules we’ve created by lift and then groups them together. You can get a rough idea of what happens on the ice given that certain players are present. Darker circles mean it’s more likely that a player’s presence is causing an event, and larger circles mean that the event is more common. A large, dark circle (like SHOT_FOR under Phil Kessel) suggests that the player-event combination is frequent and likely caused by the player’s presence. A small, darker circle (like HIT_FOR under Colton Orr) suggests that a player has a very focused purpose — in this case, Colton Orr isn’t on the ice much, but when he is, he’s out to hit somebody.

Leafs association rules matrix

Of course, everything here needs to be taken in context. Team strategy impacts individual players, so in some cases it can increase or decrease a player’s performance in some areas. Regardless, there’s a lot to look at here — we’re just scratching the surface.

Technical notes

  • Some games may be incomplete because not all events past a certain point were recorded. There were around 5600 games in the last five seasons, so I didn’t have time to verify all of them. The aggregate statistics (e.g. shots over a season) seem to match up though, so this is accurate enough for our purposes.
  • Some games are missing from the NHL website, so they were scraped from the Internet Archive ( Game 0836 from 2009-10 is missing entirely. If you have any conspiracy theories on why Bettman doesn’t want us to know what really happened in Buffalo that night, let me know.
  • The events available in the play-by-play are: hits, shots, blocks, missed shots, takeaways, giveaways, penalties (including fights), goals, faceoffs, play stoppages (e.g. goalies freezing the puck, offside, icing, TV timeouts, etc.), period start, period end, game end, extended intermission start, extended intermission end, and shootout.
  • The events I considered were hits, shots, blocked shots, missed shots, takeaways, giveaways, penalties, and goals. I consider these “active” events, since they happen during play. The others have their own analytical value and I plan to look at them another time.
  • Association rule learning can be written in terms of probabilities:
    • Support(X) = P(X)
    • Confidence(X => Y) = P(Y|X)
    • Lift(X => Y) = P(XY)/[P(X)P(Y)]
      • Two events X and Y are independent iff P(XY)=P(X)P(Y). A lift of 1 is evidence that two events are independent — a player-event combination with a lift of 1 probably has no significance. You can find more on lift here.
    • The plot from arulesViz defaults to k-means clustering on lift to group rules together. It’s a novel technique and is fun to play around with.

Hello world!

Perpetually under construction.