Alex Diaz


Mining and analyzing five seasons of NHL data

As part of a personal project, I started scraping regular season NHL play-by-play data from 2009/10 til 2013/14. I took pages like this one and make them readable to a computer, meaning I could work with the data at a really low level. It’s a detailed dataset to work with; any time an event (e.g. faceoff, hit, shot, etc.) happens, you get a list of the players from each team, their strength, and other game-related data. Mining the data was only a minor pain in the ass thanks to the Python library Beautiful Soup. After formatting the data and running it through R, I got some pretty graphs out of it. Oh, I also found some neat statistical relations.

First, I looked at the Corsi and Fenwick, since they’re new and exciting and discussed frequently. They’re the difference of the number of attempted shots between your team and the opposing team. The Corsi counts all attempts, and the Fenwick excludes blocked shots. Full details are available at Pension Plan Puppets (Corsi, Fenwick).

The results are pretty cool (see if you can spot the Edmonton Oilers!):

Percentage of possible points by Corsi percentage (2009/10-2013/14).

For a challenge, try to spot the Edmonton Oilers!

The line of best fit shows the linear relationship between the variables — generally, a higher Corsi/Fenwick means a team is more successful at earning points.

I did something similar for hits percentage. A team with a hit percentage of 0.500 hits exactly as often as its opponents, whereas a team below 0.500 gets out-hit. There’s a strong negative relationship between hitting and the possession stats. This makes sense intuitively, since if you’re hitting then you probably don’t have the puck.

Corsi percentage vs Hit percentage

The surprising part for me is that three of the last four Stanley Cup winners were out-hit in the regular season. Of course, this comes with all sorts of pitfalls – the definition of a hit is loose, and it’s possible that there’s a lot of bias working its way in. For all we know, crappy teams might have scorekeepers who count too many hits for the home team, and good teams might have the opposite situation. Regardless, it’s interesting.

Association rules

All of this is still game-level or season-level data though. The real fun of play-by-play data is that we can do stuff like association rule learning on it. One of its most well-known applications is market basket analysis. Let’s say you own a grocery store and want to know what people buy together. A basket would have something like milk, eggs, and bread in it. You can then create a rule: {milk,eggs}=>{bread}, meaning that the presence of milk and eggs is associated with the presence of bread. If you do this for every basket, you’ll see some rules come up more often than others. Rules like {milk,eggs}=>{bread} and {chips,cola}=>{salsa} would appear more often than, say, {sausage,bacon}=>{halal chicken}. You can use different measures to answer questions like: “Would higher sales of nachos increase sales of salsa?” or “Is someone more likely to buy Advil if they’re buying diapers?”. With play-by-play data we can do the same for players and events, like {Phil Kessel, James van Riemsdyk}=>{Shot taken}, or {Marc-Andre Fleury}=>{Comically reckless puck play}.

For an example, we’ve got the 2013-14 Toronto Maple Leafs. The support is how often an event occurs, out of all events in the dataset; the confidence is the chance that an event happens if a particular set of players are on the ice; and the lift measures the support for the event and the players being independent, based on how close the lift is to 1. For the Leafs, the most probable player-event is that Dion Phaneuf is on ice during an opponent’s shot. Next, it’s Phil Kessel being on the ice when the Leafs have a shot.

Rank Player(s) Event Support Confidence Lift
1 DION_PHANEUF SHOT_AGAINST 0.06218755 0.1680772 1.0388827
2 PHIL_KESSEL SHOT_FOR 0.05619953 0.1645753 1.3070524
3 JAMES_VAN_RIEMSDYK SHOT_FOR 0.05378234 0.1624896 1.2904881
4 CARL_GUNNARSSON SHOT_AGAINST 0.05361754 0.1782648 1.1018523
5 JAMES_VAN_RIEMSDYK SHOT_AGAINST 0.05356260 0.1618257 1.0002423
6 PHIL_KESSEL SHOT_AGAINST 0.05092567 0.1491313 0.9217781
7 CODY_FRANSON SHOT_AGAINST 0.05004670 0.1556467 0.9620497
8 DION_PHANEUF SHOT_FOR 0.04916772 0.1328879 1.0553920
9 JAKE_GARDINER SHOT_AGAINST 0.04905785 0.1444750 0.8929978
10 PHIL_KESSEL, JAMES_VAN_RIEMSDYK SHOT_FOR 0.04823381 0.1760931 1.3985262

The table below ranks player-events by confidence. If Kessel, JVR, Phaneuf, and Franson are all on the ice, there’s a 25.53% chance that the event is the Leafs taking a shot. The lift is 2.26, which means that the combinations of players and the event is probably not a coincidence. If you want a simpler example, look at Colton Orr: when he’s on the ice, there’s a 24.79% chance that an event will be a Leaf (probably him) making a hit. The lift is 1.74, so the fact that the Leafs are hitting is likely because he’s on the ice. Looking at the table above, Dion Phaneuf is often on the ice when there’s a shot against, but the lift is close to 1 — my interpretation is that he’s on the ice so often that he’s bound to be around when they’re stuck in their own end.

Number Player(s) Event Support Confidence Lift
2 PHIL_KESSEL, DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01098720 0.2824859 2.243495
4 COLTON_ORR HIT_FOR 0.01587650 0.2478559 1.740633
5 DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01241554 0.2364017 1.877495
9 DION_PHANEUF, JAY_MCCLEMENT SHOT_AGAINST 0.02010658 0.2067797 1.278102
10 TYLER_BOZAK, PHIL_KESSEL, CODY_FRANSON SHOT_FOR 0.01395374 0.2055016 1.632088

Looking at the events alone can give a decent overview of a team’s overall strategy. For the Leafs, it was some combination of getting outshot and hoping your goalie keeps you in the game, while outhitting your opponents and hoping that generates offense somehow. One thing to note is just how badly the Leafs get outshot. Out of all events considered, about 30.25% are the other team taking a shot, versus 23.41% for the Leafs taking a shot. This suggests that the Leafs spend a lot of time playing in their own defensive zone.

Rank Event Support
1 SHOT_AGAINST 0.16178652
2 HIT_FOR 0.14239411
3 SHOT_FOR 0.12591331
4 HIT_AGAINST 0.12036478
5 THEY_MISS 0.07059276
6 BLOCK_FOR 0.07020821
7 BLOCK_AGAINST 0.06086909
8 WE_GIVE 0.04900291
9 WE_MISS 0.04751964
10 THEY_GIVE 0.04482778

One last thing you can do is visualize these rules with the R package arulesViz. This organizes the rules we’ve created by lift and then groups them together. You can get a rough idea of what happens on the ice given that certain players are present. Darker circles mean it’s more likely that a player’s presence is causing an event, and larger circles mean that the event is more common. A large, dark circle (like SHOT_FOR under Phil Kessel) suggests that the player-event combination is frequent and likely caused by the player’s presence. A small, darker circle (like HIT_FOR under Colton Orr) suggests that a player has a very focused purpose — in this case, Colton Orr isn’t on the ice much, but when he is, he’s out to hit somebody.

Leafs association rules matrix

Of course, everything here needs to be taken in context. Team strategy impacts individual players, so in some cases it can increase or decrease a player’s performance in some areas. Regardless, there’s a lot to look at here — we’re just scratching the surface.

Technical notes

  • Some games may be incomplete because not all events past a certain point were recorded. There were around 5600 games in the last five seasons, so I didn’t have time to verify all of them. The aggregate statistics (e.g. shots over a season) seem to match up though, so this is accurate enough for our purposes.
  • Some games are missing from the NHL website, so they were scraped from the Internet Archive ( Game 0836 from 2009-10 is missing entirely. If you have any conspiracy theories on why Bettman doesn’t want us to know what really happened in Buffalo that night, let me know.
  • The events available in the play-by-play are: hits, shots, blocks, missed shots, takeaways, giveaways, penalties (including fights), goals, faceoffs, play stoppages (e.g. goalies freezing the puck, offside, icing, TV timeouts, etc.), period start, period end, game end, extended intermission start, extended intermission end, and shootout.
  • The events I considered were hits, shots, blocked shots, missed shots, takeaways, giveaways, penalties, and goals. I consider these “active” events, since they happen during play. The others have their own analytical value and I plan to look at them another time.
  • Association rule learning can be written in terms of probabilities:
    • Support(X) = P(X)
    • Confidence(X => Y) = P(Y|X)
    • Lift(X => Y) = P(XY)/[P(X)P(Y)]
      • Two events X and Y are independent iff P(XY)=P(X)P(Y). A lift of 1 is evidence that two events are independent — a player-event combination with a lift of 1 probably has no significance. You can find more on lift here.
    • The plot from arulesViz defaults to k-means clustering on lift to group rules together. It’s a novel technique and is fun to play around with.

Category: Analytics, Hockey, Statistics


2 Responses

  1. […] previously discussed potential applications of association rules learning in the second half of a post from last year. The arulesViz package for R is broken for me right now, so I won’t be able to recreate the […]