Alex Diaz

Icon

Who pays the iron price? Goalpost and crossbar hits since 2009-2010

PING! “AWWWwwwwww…”, the crowd rumbles. Depending on which side took the shot, the arena is either flooded with relief or tearing their hair out at a missed opportunity.

Is your team cursed? Do they dance with Lady Luck? Are you suspicious that the nets are ever-so-slightly smaller in some arenas? I have no idea, but luckily the NHL tracks posts and crossbars as missed shots in their play by play pages. As usual, there is some grumbling about the accuracy of their numbers, but for our purposes we’ll take them at face value. The graphs below measure the ratio of goal posts and crossbars hit to shots on goal – that is, shots that hit the net. This is a rough measurement of which NHL teams hit the post or crossbar most often.

Offensive posts

IronFor2009-10 to 2013-14

Over the last five full seasons, the Sabres and the Avalanche were relatively lucky, respectively hitting iron just 14.7 and 14.8 times for every 1000 shots on goal. The Senators fared the worst in the league with 22.3 bouncing away from the net. The league ratio (highlighted in red) was 18.6 post/crossbar hits per thousand shots.

Below we have each season individually.

The worst ratios come from the lockout season, which could be a result of sample size from a shortened season. In 2012-2013, there were 19.7 shots off the iron for every 1000 on goal; the Canucks managed to ring a whopping 31.1 per 1000. Conversely, the 2009-2010 Sabres escaped with just 10.2 for every 1000.

Defensive posts

IronAgainst2009-10 to 2013-14

As before, the league saw 18.6 posts or crossbars for every 1000 shots. The Bruins got the least help from their nets, which rang only 14.7 times per 1000 shots, while the Jets (with their somewhat truncated sample) heard 21.8 per 1000.

Again, each season presented individually:

The lockout season once more gives us the highest number: the 2012-2013 Winnipeg Jets saw 26.7 shots hit iron for every 1000 that hit the net. The 2012-2013 Penguins got no help from their nets, seeing a paltry 8.6 posts or crossbars for every 1000 shots.

Notes

It’s possible that measuring defensive goalposts has different interpretations from offensive goalposts. When we look at offensive statistics, players are facing different goaltenders and teams regularly, whereas defensive statistics are more or less the same goalies. It could be that certain goalies are more likely to catch pucks that would normally hit iron, or that they (or their defensemen) force players to miss the net entirely because of their positioning. Whether this has a measurable outcome on a game is anyone’s guess. If the numbers are accurate and hitting iron turns out to be mostly luck, then some teams face a swing of a dozen or more goals each year, balancing their fates on a half-inch bounce in the right direction. Hey, sometimes that’s all you need.

Correlations between hockey variables

Lots of people ask the basic question: are the Corsi and Fenwick statistics any good? The logic behind them — that teams that control the puck more are more successful– is sound, but often unexamined. One way to judge their usefulness is to check how they relate to other variables, like goals for/against, or how they correlate to winningness relative to another statistic like save percentage.

Pairwise correlations

Pairwise correlations compare two variables at a time. The diagonal (with the histograms) shows the distribution of a particular variable as well as its name. The left diagonal has a scatterplot of the two variables above it and to its right; the right diagonal prints the correlation coefficient for the same variables, with the font size increasing for larger correlations. An obvious example is the comparison between the Fenwick% and Corsi%: the correlation coefficient is 0.97, and the scatterplot forms a nearly perfect line — telling us that the Corsi% and Fenwick% are very, very closely related.

Looking at the rightmost column tells you how each of the variables relates to the percentage of possible points a team could earn. The Corsi% has a coefficient of 0.51, and the Fenwick 0.54.  Save percentage is also at 0.51, meaning that a high save percentage correlates to winningness about as strongly as puck possession does. The top row relates to outhitting your opponent. I’ve discussed that a bit in my first post, so it suffices to say that if it’s a useful metric at all, it tends to be negatively related to winning.

The two remaining variables are Goals Against Per Game (GAPG) and Goals For Per Game (GFPG). The obvious conclusions are evident: GAPG is very negatively correlated with earning more points (-0.74), whereas GFPG is very positively correlated (0.62). The less obvious conclusions are there too, telling us that puck possession is positively correlated with GFPG (0.32 and 0.31 for Fenwick% and Corsi%, respectively) and negatively correlated with GAPG (-0.50 and -0.49 for Fenwick% and -0.5 for Corsi%, respectively). All of this demonstrates that puck possession stats look to be good predictors of success.

Controlling for save percentage

In my first post I uploaded graphs that showed a strong link between puck possession and success in the regular season and playoffs. The issue with that, though, is that some teams had very good possession numbers, yet didn’t qualify for the playoffs, while others achieved the opposite. The 2010-2011 Bruins had an even-strength Corsi of 50.73% — fairly average — and yet they won the Stanley Cup. Last year’s Toronto Maple Leafs, however, managed to have possession numbers among the worst in the last five years, but still qualified for the playoffs. One explanation that gets thrown around is save percentage — and it turns out it’s a decent one.

Controlled Save Pct

Blue dots qualified for the playoffs, red dots are Stanley Cup champions.

This graph plots the percentage of possible points that teams earned in a regular season against the Corsi%. The graphs are split into six groups based on a team’s save percentage. The teams with the lowest save percentages (roughly below .910) are in the bottom left, and the teams with the highest at the top right (roughly above 0.928). The main things to notice here:

  • The teams with the lowest save percentages necessarily need a good Corsi to qualify for the playoffs. As the save percentages get lower , teams with lower Corsi percentages have a harder time making the playoffs — you can see this by the positions of the black dots (non qualifying teams) in the bottom row.
  • Stanley Cup champions have had middling goaltending in the regular season (Chicago, 2009-2010), but they need damn good possession stats
  • Teams with consistently elite goaltending (Boston, 2010-2011) can win the Stanley Cup with a fairly average Corsi

What’s the moral of the story then? Good puck possession and a solid goalie are keys to winning (duh). If a team is weak in one of the two areas, though, then they must counterbalance with strength in the other, especially if the weakness is possession.1 A team with terrible possession stats might sneak into the playoffs, but don’t expect them to go anywhere without their goalie stealing the Cup.

1If your Fenwick % is floating around 0.45, you should probably work on that instead of hunting down Dominik Hasek for his blood.

Measuring the value of Crosby, Getzlaf, and Giroux to their teams

Last week, the Hart Memorial Trophy candidates were announced. According to the infallible internet, it’s likely that Crosby will win — but let’s figure out if that’s true. To get the obvious stats out of the way, Crosby (36G, 68A, 80GP), Getzlaf (31G, 56A, 77GP), and Giroux(28G, 58A, 82GP) finished first, second, and third in scoring this year, with points-per-games of 1.30, 1.13, and 1.05, respectively. Pittsburgh, Anaheim, and Philadelphia finished with 242, 263, and 233 goals, respectively. Since the Hart looks at a player’s value to his team, it makes sense to look at his contributions to the team’s overall scoring.

Pie charts are terrible.

Candidate’s points (in teal) as a proportion of a team’s overall goals.

Looking at points alone, Crosby has a pretty huge head start over the other two. Now, the Hart (allegedly) isn’t the Art Ross 2.0, so it makes sense for us to look at possession statistics and the some frequencies from association rule mining.

Team Strength Candidate Corsi % Fenwick %
Pittsburgh All On 0.6036322 0.6095969
Pittsburgh All Off 0.4225558 0.4267751
Anaheim All On 0.5423348 0.5500000
Anaheim All Off 0.4812510 0.4914966
Philadelphia All On 0.5998837 0.5955325
Philadelphia All Off 0.4552777 0.4539683
Pittsburgh Even On 0.5308595 0.5372152
Pittsburgh Even Off 0.4638907 0.4685562
Anaheim Even On 0.5193694 0.5249392
Anaheim Even Off 0.4928212 0.4995057
Philadelphia Even On 0.5437117 0.5356383
Philadelphia Even Off 0.4811052 0.4763085

You may have noticed that Crosby and Giroux have a larger impact on the ice for their team than Getzlaf. These differences become much clearer when they’re visualized.

Fenwick strength status

At this point it becomes clear that if there’s any competition, it’s between Crosby and Giroux. While all of the players improve their teams’ performances, it’s obvious that Getzlaf’s relative contribution is not as strong as either of the other two.

Looking deeper, the Fenwick percentage at even-strength tilts the odds further towards Crosby and quite a bit farther away from Getzlaf. The next combination of graphs compares the game-by-game Fenwick.

Hart multiplot even

Blue and red lines represent season averages with and without the player on the ice, respectively. Black lines are the team average.

Two things to notice here: (1) Pittsburgh’s possession stats with Crosby are higher than Philadelphia’s with Giroux; and (2) Pittsburgh possession stats without Crosby are lower than Philadelphia’s without Giroux. This is especially evident when you look at the gaps between points — Giroux is very good, but Crosby absolutely lifts his team. This caught me a bit off guard, since until I wrote this post I hadn’t noticed that Pittsburgh finished the regular season with Corsi and Fenwick percentages below 0.500.

When we look at association rules, we the same story being told, albeit in a different manner.

Rank Player Event Support Confidence
1 Any Shot for 0.155 0.155
2 Any Shot against 0.149 0.149
3 Any Hit for 0.137 0.137
4 Any Hit against 0.133 0.133
5 Any Block for 0.075 0.075
6 Sidney Crosby Shot for 0.073 0.201
7 Any Block against 0.069 0.069
8 Any They miss 0.062 0.062
9 Chris Kunitz Shot for 0.062 0.201
10 Matt Niskanen Shot for 0.061 0.178
Rank Player Event Support Confidence
1 Any Shot against 0.150 0.150
2 Any Shot for 0.149 0.149
3 Any Hit for 0.130 0.130
4 Any Hit against 0.125 0.125
5 Any Block against 0.077 0.077
6 Any Block for 0.072 0.072
7 Claude Giroux Shot for 0.064 0.188
8 Braydon Coburn Shot against 0.061 0.175
9 Any We miss 0.060 0.060
10 Jakub Voracek Shot for 0.057 0.200

The main takeaway from Crosby’s table is how high up his generation of offense is — about 7.3% of all active events in the game are a Pittsburgh shot on goal while he’s on the ice, and when he’s on the ice, there’s 20.1% chance that the active event will be a Pittsburgh shot hitting the net. In fact, Crosby was on the ice for a Pittsburgh shot on goal more often than any player was on the ice for an opponent having their shot blocked. Crosby’s linemate Kunitz is only on for 6.4% of Pittsburgh’s shots, so that suggests that Crosby is doing quite a bit on his own. (As a note, you’ll see similar stuff for players like Erik Karlsson, who tend to be head and shoulders above their teammates, even if their teammates are very skilled on their own.)

What’s the conclusion here? In terms of relative contributions to their teams, this is a race between Crosby and Giroux — one that Crosby will very probably win.

Mining and analyzing five seasons of NHL data

As part of a personal project, I started scraping regular season NHL play-by-play data from 2009/10 til 2013/14. I took pages like this one and make them readable to a computer, meaning I could work with the data at a really low level. It’s a detailed dataset to work with; any time an event (e.g. faceoff, hit, shot, etc.) happens, you get a list of the players from each team, their strength, and other game-related data. Mining the data was only a minor pain in the ass thanks to the Python library Beautiful Soup. After formatting the data and running it through R, I got some pretty graphs out of it. Oh, I also found some neat statistical relations.

First, I looked at the Corsi and Fenwick, since they’re new and exciting and discussed frequently. They’re the difference of the number of attempted shots between your team and the opposing team. The Corsi counts all attempts, and the Fenwick excludes blocked shots. Full details are available at Pension Plan Puppets (Corsi, Fenwick).

The results are pretty cool (see if you can spot the Edmonton Oilers!):

Percentage of possible points by Corsi percentage (2009/10-2013/14).

For a challenge, try to spot the Edmonton Oilers!

The line of best fit shows the linear relationship between the variables — generally, a higher Corsi/Fenwick means a team is more successful at earning points.

I did something similar for hits percentage. A team with a hit percentage of 0.500 hits exactly as often as its opponents, whereas a team below 0.500 gets out-hit. There’s a strong negative relationship between hitting and the possession stats. This makes sense intuitively, since if you’re hitting then you probably don’t have the puck.

Corsi percentage vs Hit percentage

The surprising part for me is that three of the last four Stanley Cup winners were out-hit in the regular season. Of course, this comes with all sorts of pitfalls – the definition of a hit is loose, and it’s possible that there’s a lot of bias working its way in. For all we know, crappy teams might have scorekeepers who count too many hits for the home team, and good teams might have the opposite situation. Regardless, it’s interesting.

Association rules

All of this is still game-level or season-level data though. The real fun of play-by-play data is that we can do stuff like association rule learning on it. One of its most well-known applications is market basket analysis. Let’s say you own a grocery store and want to know what people buy together. A basket would have something like milk, eggs, and bread in it. You can then create a rule: {milk,eggs}=>{bread}, meaning that the presence of milk and eggs is associated with the presence of bread. If you do this for every basket, you’ll see some rules come up more often than others. Rules like {milk,eggs}=>{bread} and {chips,cola}=>{salsa} would appear more often than, say, {sausage,bacon}=>{halal chicken}. You can use different measures to answer questions like: “Would higher sales of nachos increase sales of salsa?” or “Is someone more likely to buy Advil if they’re buying diapers?”. With play-by-play data we can do the same for players and events, like {Phil Kessel, James van Riemsdyk}=>{Shot taken}, or {Marc-Andre Fleury}=>{Comically reckless puck play}.

For an example, we’ve got the 2013-14 Toronto Maple Leafs. The support is how often an event occurs, out of all events in the dataset; the confidence is the chance that an event happens if a particular set of players are on the ice; and the lift measures the support for the event and the players being independent, based on how close the lift is to 1. For the Leafs, the most probable player-event is that Dion Phaneuf is on ice during an opponent’s shot. Next, it’s Phil Kessel being on the ice when the Leafs have a shot.

Rank Player(s) Event Support Confidence Lift
1 DION_PHANEUF SHOT_AGAINST 0.06218755 0.1680772 1.0388827
2 PHIL_KESSEL SHOT_FOR 0.05619953 0.1645753 1.3070524
3 JAMES_VAN_RIEMSDYK SHOT_FOR 0.05378234 0.1624896 1.2904881
4 CARL_GUNNARSSON SHOT_AGAINST 0.05361754 0.1782648 1.1018523
5 JAMES_VAN_RIEMSDYK SHOT_AGAINST 0.05356260 0.1618257 1.0002423
6 PHIL_KESSEL SHOT_AGAINST 0.05092567 0.1491313 0.9217781
7 CODY_FRANSON SHOT_AGAINST 0.05004670 0.1556467 0.9620497
8 DION_PHANEUF SHOT_FOR 0.04916772 0.1328879 1.0553920
9 JAKE_GARDINER SHOT_AGAINST 0.04905785 0.1444750 0.8929978
10 PHIL_KESSEL, JAMES_VAN_RIEMSDYK SHOT_FOR 0.04823381 0.1760931 1.3985262

The table below ranks player-events by confidence. If Kessel, JVR, Phaneuf, and Franson are all on the ice, there’s a 25.53% chance that the event is the Leafs taking a shot. The lift is 2.26, which means that the combinations of players and the event is probably not a coincidence. If you want a simpler example, look at Colton Orr: when he’s on the ice, there’s a 24.79% chance that an event will be a Leaf (probably him) making a hit. The lift is 1.74, so the fact that the Leafs are hitting is likely because he’s on the ice. Looking at the table above, Dion Phaneuf is often on the ice when there’s a shot against, but the lift is close to 1 — my interpretation is that he’s on the ice so often that he’s bound to be around when they’re stuck in their own end.

Number Player(s) Event Support Confidence Lift
1 PHIL_KESSEL, JAMES_VAN_RIEMSDYK, DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01032797 0.2852807 2.265692
2 PHIL_KESSEL, DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01098720 0.2824859 2.243495
3 JAMES_VAN_RIEMSDYK, DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01043784 0.2749638 2.183755
4 COLTON_ORR HIT_FOR 0.01587650 0.2478559 1.740633
5 DION_PHANEUF, CODY_FRANSON SHOT_FOR 0.01241554 0.2364017 1.877495
6 TYLER_BOZAK, PHIL_KESSEL, JAMES_VAN_RIEMSDYK, CODY_FRANSON SHOT_FOR 0.01285502 0.2186916 1.736842
7 TYLER_BOZAK, JAMES_VAN_RIEMSDYK, CODY_FRANSON SHOT_FOR 0.01345932 0.2156690 1.712837
8 PHIL_KESSEL, JAMES_VAN_RIEMSDYK, CODY_FRANSON SHOT_FOR 0.02126023 0.2086253 1.656897
9 DION_PHANEUF, JAY_MCCLEMENT SHOT_AGAINST 0.02010658 0.2067797 1.278102
10 TYLER_BOZAK, PHIL_KESSEL, CODY_FRANSON SHOT_FOR 0.01395374 0.2055016 1.632088

Looking at the events alone can give a decent overview of a team’s overall strategy. For the Leafs, it was some combination of getting outshot and hoping your goalie keeps you in the game, while outhitting your opponents and hoping that generates offense somehow. One thing to note is just how badly the Leafs get outshot. Out of all events considered, about 30.25% are the other team taking a shot, versus 23.41% for the Leafs taking a shot. This suggests that the Leafs spend a lot of time playing in their own defensive zone.

Rank Event Support
1 SHOT_AGAINST 0.16178652
2 HIT_FOR 0.14239411
3 SHOT_FOR 0.12591331
4 HIT_AGAINST 0.12036478
5 THEY_MISS 0.07059276
6 BLOCK_FOR 0.07020821
7 BLOCK_AGAINST 0.06086909
8 WE_GIVE 0.04900291
9 WE_MISS 0.04751964
10 THEY_GIVE 0.04482778

One last thing you can do is visualize these rules with the R package arulesViz. This organizes the rules we’ve created by lift and then groups them together. You can get a rough idea of what happens on the ice given that certain players are present. Darker circles mean it’s more likely that a player’s presence is causing an event, and larger circles mean that the event is more common. A large, dark circle (like SHOT_FOR under Phil Kessel) suggests that the player-event combination is frequent and likely caused by the player’s presence. A small, darker circle (like HIT_FOR under Colton Orr) suggests that a player has a very focused purpose — in this case, Colton Orr isn’t on the ice much, but when he is, he’s out to hit somebody.

Leafs association rules matrix

Of course, everything here needs to be taken in context. Team strategy impacts individual players, so in some cases it can increase or decrease a player’s performance in some areas. Regardless, there’s a lot to look at here — we’re just scratching the surface.

Technical notes

  • Some games may be incomplete because not all events past a certain point were recorded. There were around 5600 games in the last five seasons, so I didn’t have time to verify all of them. The aggregate statistics (e.g. shots over a season) seem to match up though, so this is accurate enough for our purposes.
  • Some games are missing from the NHL website, so they were scraped from the Internet Archive (archive.org). Game 0836 from 2009-10 is missing entirely. If you have any conspiracy theories on why Bettman doesn’t want us to know what really happened in Buffalo that night, let me know.
  • The events available in the play-by-play are: hits, shots, blocks, missed shots, takeaways, giveaways, penalties (including fights), goals, faceoffs, play stoppages (e.g. goalies freezing the puck, offside, icing, TV timeouts, etc.), period start, period end, game end, extended intermission start, extended intermission end, and shootout.
  • The events I considered were hits, shots, blocked shots, missed shots, takeaways, giveaways, penalties, and goals. I consider these “active” events, since they happen during play. The others have their own analytical value and I plan to look at them another time.
  • Association rule learning can be written in terms of probabilities:
    • Support(X) = P(X)
    • Confidence(X => Y) = P(Y|X)
    • Lift(X => Y) = P(XY)/[P(X)P(Y)]
      • Two events X and Y are independent iff P(XY)=P(X)P(Y). A lift of 1 is evidence that two events are independent — a player-event combination with a lift of 1 probably has no significance. You can find more on lift here.
    • The plot from arulesViz defaults to k-means clustering on lift to group rules together. It’s a novel technique and is fun to play around with.