Alex Diaz


Measuring Corsi For and Against with Association Rules

I’ve previously discussed potential applications of association rules learning in the second half of a post from last year. The arulesViz package for R is broken for me right now, so I won’t be able to recreate the graph at the bottom without considerable effort. For now, we’ll use graphics I’ve generated on my own. While I’ve created some interesting results, I would like to strongly emphasize that this is essentially a prototype, so going forward one must remember the assumptions being made and the contexts of the statistics being analyzed.

Last weekend I presented some of my findings at Ottawa Hockey Analytics, and what I’m writing about today is largely the same. My data comes entirely from the NHL’s Play By Play (PBP) and Player TOI tables. For this work it suffices to use PBP data on its own; I find Corsi events reliable at correctly recording the players present on the ice. However, I originally intended this dataset to have other purposes, so I stripped the PBP of its player information and added it using TOI tables. There may be some errors as a consequence (e.g if shifts were not properly recorded). I organized the data into binary columns for every event and player. If a player is on the ice during an event, he is marked as a 1; those off the ice are marked as 0s. Events are recorded similarly. This is a sample of what I end up with:

Transaction dataset example (201314 LA)


In association rule learning, a binary dataset like this one can be thought of as a big list of itemsets. For our purposes, our items are players and events, and we’d like to measure how often they appear together. Our motivation is to highlight players’ presences during “good” or “bad” events; in this example, we’ll use Corsi For and Corsi Against, respectively. This technique is not limited to Corsi events — in fact, I’d like to expand it to many other events — but for now they’re the easiest to record and they have predictive value.

Before showing the results, I’ll provide a quick primer:

  • An itemset X is a a collection of items. It’s basically a row in our dataset. In the screenshot, our first itemset {ANZE_KOPITAR, JONATHAN_QUICK, DWIGHT_KING, CORSI_FOR}. Eventually I removed the goaltenders – I believe they have analytic value, but for now I’m keeping it simple. Well, simple-ish.
  • A rule X => Y is an implication between itemsets X and Y, written as X => Y.
    • Rules are split into the left-hand side (LHS, or “antecedent”) and right-hand side (RHS, or “consequent”)
    • Here, we only consider players on the LHS and events on the RHS
    • For example, our rules will appear as {ANZE_KOPITAR, DWIGHT_KING} => {CORSI_FOR}

The following metrics are interest measures. As the name implies, they are meant to highlight interesting relationships between variables.

  • The support of an itemset X is defined as the proportion of the database in which X appears
    • For us, a higher support means that the combination of players and events happens more often. Players with more ice time will necessarily have higher support simply because they’re on the ice when more things happen
  • The confidence of a rule X => Y is the ratio of the support for X and Y to the support of X alone.
    • CONF(X => Y) = SUPP(X, Y)/SUPP(X)
  • The lift of a rule X => Y is the ratio of the support for X and Y to the product of the supports of X and Y individually
    • LIFT(X => Y) = SUPP(X, Y)/[SUPP(X)*SUPP(Y)]
  • The difference of confidence of a rule X => Y is the difference of confidence between X => Y and ¬X => Y
    • DOC(X => Y) = CONF(X => Y) – CONF(¬X => Y)

Interest measures can be interpreted in a probability context. If we restrict our sample space to all Corsi events over an entire season, we are measuring the probability that Corsi events occur with respect to the players on the ice. Letting X = {Players on ice} and Y = {Event}, we get:

  • SUPP(X, Y) = P(X, Y) = Probability that a Corsi event and a particular combination of players occurs
  • CONF(X => Y) = P(X, Y)/P(X) = P(Y|X) = Probability that Y occurs, given that player combination X is on the ice
  • LIFT(X => Y) =P(X, Y)/[P(X)P(Y)]. Statistical independence between events X and Y is defined as P(X)P(Y) = P(X, Y), so the lift’s closeness to 1 could be used to indicate independence.
  • DOC(X => Y) = P(Y|X) – P(Y|¬X) = Probability that an event occurs given player combination X is on the ice – Probability that an event occurs given player combination X is not on the ice
    • Note that X is the entire combination of players. If X = {KOPITAR, KING}, then ¬X = {All sets without both of KOPITAR, KING}. Thus, ¬X includes all events where: (1) Kopitar is on the ice but King is not; (2) King is on the ice but Kopitar is not; and (3) Neither Kopitar nor King are on the ice

With all that in mind, here is a description of what I’m working with:

  • The dataset contains all even-strength 5v5 tied Corsi events from 2013-14
  • All Corsi events are treated equally. I haven’t made adjustments for quality of competition,
  • We are limited to how often player-event combinations occur within a team. There are two sides to this: the first is how often a situation exists; the second is how often a player is in the situation to begin with
  • All analysis was done within teams. Comparing between teams may not be useful.

I believe the best application of this analysis is to compare how players are faring in their current roles on their teams. We can highlight player chemistry, their performances relative to teammates in the same position, and (hopefully) hidden potential. Over time, we may also be able to use multiple seasons to see how a player has performed with different teammates and team strengths.

I’ve generated six graphs for each team:

  1. All individual performances
  2. Defensemen, as individuals
  3. Defensemen, as pairs
  4. Forwards, as individuals
  5. Forwards, as pairs
  6. Forwards, as trios

Support for Corsi Against is on the left; support for Corsi For is on the right. More green means that the Difference of Confidence is higher for that statistic, implying that the event is more likely give that the specific player combination is on the ice. More purple is the opposite (an event is relatively less likely given that player combination). Rules with fewer than 25 occurrences were dropped, so if a column is missing on one side it’s probably because it missed that threshold.

Important note: The initial batch of images I uploaded on Feb 13th had some errors based on when players were on the ice. I have since fixed the error in my code but I won’t be immediately redoing the analysis for each team. Here is a direct link to the album.

Examples and discussion

Toronto Maple Leafs, 2013-14:

  • Possession numbers are terrible across the board
  • Phaneuf is on the ice for about a fifth of even-strength tied shot attempts against with a very strong difference of confidence against him
  • Jake Gardiner and Morgan Rielly have the strongest indication in favour of possession among defensive pairings
  • Forward possession is driven by Kessel and Van Riemsdyk. Looking at forward pairs, it suggests the Kessel and JVR have stronger chemistry than either combining with Bozak
    • Kadri’s support for Corsi For/Against are better when paired with either winger, and with stronger difference of confidence, suggesting he could be a better first line centre
  • Lupul and McClement had awful years

Los Angeles Kings, 2013-14:

  • The top pairing of Doughty-Muzzin is strong. The difference of confidence at an individual level is much higher in Muzzin’s favour, though he and Doughty don’t seem to have shared defence partners at any point
  • Anze Kopitar drives play strongly
  • Tyler Toffoli had a strong year with a variety of linemates

Category: Analytics, General, Hockey


Leave a Reply