You are currently browsing the category archive for the ‘nhl’ category.

I was prepping my hockey data today to try out a few machine learning algorithms, and ran into a bit of a pickle.

One of the metrics I wanted to extract was the number of hits per team per game, so I tapped in this query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

The query took a little longer to run than expected, but I chalked that up to cosmic radiation. When I looked at my results, though, I was a bit shocked: over 300,000 answers! Sure, there are 1230 games in a season, and I was extracting 32 facts for each one, but that’s a whole order of magnitude lower that what I was seeing.

After a little while full of WTFs and glances at the SPARQL spec for aggregations over optional clauses, it hit me – I had overloaded the nhl:team property.  Both games and events had nhl:team facts (nhl:team is a super-property of both nhl:hometeam and nhl:awayteam):

nhl:game-1 nhl:hometeam nhl:team-10 .
nhl:game-1 nhl:awayteam nhl:team-8 .
nhl:eid-TOR51 nhl:team nhl:team-10 .

I had a choice: either I could redo my JSON parse or my SPARQL query.  Both required about the same amount of effort, but a full parse of all the game files would take over 40 minutes.  Thus, I rewrote the query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g a :Game; :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

To make this work, I had to add the facts that each game was in fact an instance of the nhl:Game class.  Fortunately, I had two properties whose domain was precisely hockey games:

nhl:hometeam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

nhl:awayteam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

As luck would have it, somebody at tonight’s SemWeb meetup group wondered aloud whether anybody found rdfs:domain and rdfs:range useful.  Neat-o.

Advertisements

And by best, I mean the most physical as measured by hits.  In a previous post, I thought it would be interesting to predict the most physical games in a season based on the previous actions of the players.  So a good first step is to measure each game by its physicality and save off the data.

Unfortunately, the raw data I’m working with doesn’t provide identifiers for the games themselves.  People, teams, and game events (like hits) get id’s, but the games end up with a blank node.  Kind of ruins the whole ‘list the games by physicality’ idea, so on to plan B: pairings.

Since teams get identifiers in the dataset, I could get the home and visiting teams for each.  Most teams in a season will play more than twice, so a pairing for “New York Rangers at Buffalo Sabres” will include two games, as will “Buffalo Sabres at New York Rangers”.  Not all pairings will be symmetric, though – the Rangers hosted the LA Kings for the only match-up between the teams in 2010-11.

The game data is broken up into 1230 n-triples files, each corresponding to a game.  I ended up naming them file-1.nt, file-2.nt, etc.  This is convenient for defining functions that build an RDF model from each game from N to M (e.g. file-23 to file-100):

(defn get-data[ n m] (map #(str "data/file-" % ".nt") (range n (+ 1 m))))
(defn get-model [n m] (apply build (get-data n m)))

Since physicality is based on the hits recorded in each game, a good place to start is a construct query I called ‘game-hits’.

(def game-hits "
  prefix nhl: <http://www.nhl.com/>
  construct { _:z a nhl:GameHits . _:z nhl:hometeam ?x .
	      _:z nhl:awayteam ?y . _:z nhl:game ?g . _:z nhl:value ?v }
	{ select ?x ?y ?g (count(?h) as ?v)
	  {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
	     ?g nhl:play ?h . ?h a nhl:Hit . filter(?x != ?y)}
	  group by ?x ?y ?g }
")

This construct has to be pulled from each file of game data I have for the 2010-11 season, so I next write two short functions to run through each game file and execute the construct, iteratively building a model of just the facts I need.  This way, I keep the model size down.

(defn build-game-model [s]
  (let [  n1 (get-model (first s) (last s)) ]
    (build (pull game-hits n1))	))

(defn build-model [n m binsize]
  (reduce build (map #(build-game-model %) (partition-all binsize (range n (+ m 1)))))	)

Finally, I’m ready to save off my first batch of summary data: the number of hits for each home-away team pairing, and the number of games that constitute that pairing.

(def teams-hits "
prefix nhl: <http://www.nhl.com/>
construct { _:z a nhl:HitsSummary . _:z nhl:hometeam ?x .
	    _:z nhl:awayteam ?y . _:z nhl:games ?games . 
        _:z nhl:value ?value }
	{ select ?x ?y (count(?g) as ?games) (sum(?v) as ?value)
	  { ?z a nhl:GameHits . ?z nhl:hometeam ?x .
	    ?z nhl:awayteam ?y . ?z nhl:game ?g . ?z nhl:value ?v . 
        filter(?x != ?y)}
	group by ?x ?y	}
")

I’ve described how I pulled geo-coordinates for each team in a previous post, so I’ll skip the details and note the final construct query that joins the DBpedia data with my raw NHL data:

(def city-coords "
prefix nhl: <http://www.nhl.com/>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
construct { ?x nhl:latitude ?lat . ?x nhl:longitude ?lon . 
            ?x nhl:cityname ?cityname . ?x nhl:name ?a }
{ ?x a nhl:Team . ?y a dbo:HockeyTeam .
  ?x rdfs:label ?a . ?y rdfs:label ?b . filter(str(?a) = str(?b))
  ?y nhl:latitude ?lat . ?y nhl:longitude ?lon . ?y dbp:city ?cityname}
")

Finally, I have a a select query that defines the information I want for my physicality scale (with geo-coordinates thrown in for the home team for kicks):

(def hits-summary-names "
prefix nhl: <http://www.nhl.com/>
select ?visitor ?home (?v / ?g as ?avg) ?lat ?lon
{ ?z a nhl:HitsSummary . ?z nhl:hometeam ?x . 
  ?x nhl:latitude ?lat . ?x nhl:longitude ?lon .
  ?z nhl:awayteam ?y . ?z nhl:games ?g . ?z nhl:value ?v . 
  filter(?x != ?y) .
  ?x nhl:name ?home . ?y nhl:name ?visitor }
order by desc(?avg)
")

At this point I’m ready to run my program, which consists of just a few lines of code:

(stash (pull teams-hits (build-model 1 1230 5)) "files/teams-hits.nt")
(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)
(stash (pull city-coords 
          (build "files/geo-facts.nt" "files/team-city.nt" 
                 "files/ontology.ttl" "files/nhl.rules" (get-model 1 35))) 
  "files/city-coords.nt")
(def m (build "files/city-coords.nt" "files/teams-hits.nt"))
(view (histogram :avg :data (bounce hits-summary-names m) :nbins 5))
(view (bounce hits-summary-names m))

The distribution of physicality per pairing:

The top pairings by physicality:

The top teams by physicality:

The full code can be found on GitHub.  The next step is to figure out what factors can be used to predict this physicality distribution.  Enforcement will be one of them, but it would help to have a slew of others to mix in.  At this point, I’m thinking a principal components analysis would be interesting to run.

I was all ready to grab some geo-data and see where I could find a good physical game of hockey.  I’d browsed dbpedia and saw how to connect NHL teams, the cities they play in, and the lat-long coordinates of each.  I was so optimistic.

Then it turns out that DBpedia wanted me to do a special dance.  I’ve been quite happy with the features of the upcoming Sparql 1.1 spec.  Since ARQ stays on top of the spec, I’ve managed to forget what Sparql 1.0 was missing.  Well, ‘if’ clauses for one, but I managed to design around that in my last post.  A real sticker, though, was the inability to wrap a construct query around a select query, like so:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:latitude ?v . ?team nhl:name ?name }
{	select ?team ?name ( 1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
	{ 	  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
		  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
		  ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
		  filter ( lang(?name) = 'en') }}

The reason this is critical is that you can’t inject those arithmetic expressions into a construct clause.  And since I plan on working with the resulting data using Sparql, simply using select queries isn’t going to do it.

Thus, we need to break down the steps a bit more finely.  First, I’ll pull out the basic triples I intend to work with:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
                ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s . }
   {   ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
        filter ( lang(?name) = 'en') }

And crap – the data doesn’t fit very well.  Looks like the city names associated with hockey teams don’t cleanly match up to the cities we’re looking for in DBpedia.  Time for a second refactor…

After a few minutes of staring at the ceiling, I realized that I could use Google’s geocoding service to do my bidding.  Since their daily limit is 2500 requests, my measly 50ish cities would be well under.  So first, I grab just the info I need out of DBpedia – hockey teams and the cities they’re associated with:

 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
 prefix dbo: <http://dbpedia.org/ontology/>
 prefix dbp: <http://dbpedia.org/property/>
 prefix nhl: <http://www.nhl.com/>
 construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                 ?team dbp:city ?cityname . ?city rdfs:label ?cityname . }
      { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
         ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
         filter ( lang(?name) = 'en') }

And use this query with a bit of Clojure to pull out my geocoding facts, saving them off as an n-triples files:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
select distinct ?team ?name ?city ?cityname
{ ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
  filter ( lang(?name) = 'en') }
(defn get-geo-fact [row]
(let [ n (string/replace (:cityname row) " "  "+")
  x (json/read-json (slurp (str
         "http://maps.googleapis.com/maps/api/geocode/json?address="
         n
         "&sensor=false")))
  g (:location (:geometry (first (:results x))))
  lat (str "<" (:city row) ">" 
           " <http://www.nhl.com/latitude> " 
           (:lat g) " ." )
  lon (str "<" (:city row) ">" 
           " <http://www.nhl.com/longitude> " 
           (:lng g) " ." )	]
[lat lon] ))

(defn make-geo-facts []
 (let [ a (bounce team-city dbp)
  f "files/geo-facts.nt" ]
 (spit f (string/join "\n" (flatten (map get-geo-fact (:rows a)))))	))

The results are created with the following two calls at the REPL:

(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)

Now that I have geo-data, I can finish with hits-per-game as a rough cut at a physicality scale for games, and see where the action was this season.  I wonder if the Islanders and Rangers still go at it.

This post’ll be a short one tonight, since I just had to watch Adventureland (and thank god I did – that’s a great movie).

The analysis of enforcement doesn’t seem well-suited to a statistical approach, and I think it’s because there’s no pattern or distribution to be explained or predicted.  There aren’t any awards for enforcement, and top-ten lists of enforcers aren’t sufficiently … important? … to warrant prediction.

However, predicting which cities in the NHL will host the games with the most physical action is undeniably of public benefit.  As the League has cracked down on fighting and physical play over the years, it’s become increasingly important to find out where the good physical games will be played.

Although consideration for a scale of game physicality is appropriate, for now we’ll just tally the number of hits per game between each pairing of teams:

prefix nhl: <http://www.nhl.com/>
construct {
  _:z a nhl:HitsSummary . _:z nhl:participant ?x .
  _:z nhl:participant ?y . _:z nhl:value ?v }
  {  select ?x ?y (count(?h) as ?v)
     {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
        ?g nhl:play ?h . ?h a nhl:Hit . }
  group by ?x ?y	}

I’d like to take these results and tie them to the geo-coordinates of each team’s home city, so that I can generate a map depicting the relative physicality of each city that season.

To do this, I can use DBpedia’s endpoint to grab each team’s city’s lat/long.  On inspection, it seems that the DBpedia data provides the relevant geo-coordinates in decimal-minute-second form.  I’d much rather have it in decimal format (so that I can associate a single value for each of latitude and longitude to a team), so a little arithmetic is in order.  Fortunately, Sparql 1.1 allows arithmetic expressions in the select clauses.  Here’s the longitude query (with the latitude query being similar):

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:longitude ?v . ?team nhl:name ?name }
  {  select ?team ?name
            ( -1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
     {  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:longd ?d; dbp:longm ?m; dbp:longs ?s .
        filter ( lang(?name) = 'en') }}

The DBpedia server is a little persnickety about memory, so it complains if I try to ask for lat and long in the same query.

You might be wondering about that -1 in the front of that arithmetic.  Longitude is positive or negative depending on whether it’s east or west of the Prime Meridian.  Since all the NHL hockey teams are west of the Meridian, the decimal longitude is negative.  If the DBpedia endpoint were compliant with the very latest Sparql 1.1 spec, I could have used an IF operator to interpret the East/West part of the coordinate.  However, it seems that feature isn’t implemented yet in DBpedia’s server, so this’ll have to do.

All that’s left is to query for each game, its home and away teams, and join those results with the relevant coordinates and hits-summary.  That query might take a few minutes for a whole season, but it’s a static result and thus safe to save off as an RDF file for quick retrieval later.

What does this have to do with enforcement?  Well, with next season’s schedule and the current rosters, we might be able to predict this physicality distribution based on the enforcement of the players on each team for each game.  Perhaps teams with higher enforcement correlate with more physical games.  Or, maybe a game tends to have higher physicality when one team has much more enforcement than the other.  And with historical data from past seasons, there should be some opportunity for verification.  And maybe even neat maps.

A few posts back, I wrote the following as a rule for detecting enforcement events:

An event is an enforcement if it is an Enforcer Action taking place within 2 minutes after a Violent Action against one of the teammates of the Enforcer.

Since then, a few parts of this rule have been bugging me:

  • what’s an enforcer action?
  • why two minutes?
  • what’s a violent action?

So I’m going to take a step back and look at this again.  I’m defining enforcement event to be a manifestation of an enfoo quantity (like gravitational attraction would be for a mass quantity).  If these manifestations are what makes somebody an enforcer, then it may be that a pattern amongst these events can be extracted that determines an ordering of players sufficiently close to the original crude orderings of enforcers. So,

  1. hockey players can be crudely ordered according to their capabilities as an enforcer.
  2. I dub this capability enfoo
  3. I hypothesize that enfoo manifests in certain ways that can be analyzed to yield an ordering of players according to their degree of enfoo
    • That is, I hypothesize that enfoo can be measured via enfoo manifestations, which I call enforcements
  4. A scale for measuring enfoo via enforcements is defined, and is called the barce scale.
  5. Barce measurements are made on NHL play-by-play data, and the resulting player orderings are compared to crude orderings.
    • If the barce orderings track the crude orderings closely, then the barce scale captures the intuitions of the creators of the crude orderings (i.e. experts).
    • If the barce and crude orderings are very different, then all is not lost.  If the semantics of the barce scale are interesting, then the scale provides a new perspective on the concept of enforcement.

This all tracks the 1968 work of Brian Ellis quite well (this would be a fundamental derived measurement in his theory).  However, not everything rests on solid ground here.  Ellis didn’t really discuss social quantities (like enfoo), instead concentrating on physical quantities like mass, pressure, and electrical resistance.  Thus, the idea of looking at events (manifestations of enfoo) brings with it some difficulties.

One difficulty is that the units of the barce scale are quite unlike grams and meters.  When defining the meter, experts talk about how far light travels in a vacuum in a very short period of time.  The gram is based on an actual object known as the IPK (International Prototype Kilogram).

The barce scale introduces two kinds of difficulties, which I alluded to at the top of this post: subjectivity and causal influence.

The events of a hockey game, such as penalties and goals, are only identified by certain officials on the ice and in the booth.  It doesn’t matter whether you or I think a hit was a cross-check, even if the officials for that game agree with us the next day – if an official doesn’t identify the hit as a cross-check (thus penalizing the hitter) during the game, then it’s not a cross-check.  The events of a hockey game, then, cannot be be determined without knowing the officials’ in-game decisions.  Fortunately, these decisions are themselves objectively verifiable due to their documentation.

A more worrisome difficulty is the detection of causal influence.  Enforcements are defined as events that are in some sense caused by a provocation by a member of the other team.  For example, we might consider the following causal influences:

  1. An enforcement would not have occurred had it not been for the provocation.
  2. A provocation increased the probability of an enforcement occurring.
  3. A provocation inclined a player to perform an enforcement.

An added difficulty builds on the problem with subjectivity mentioned above: a provocation may not be noticed by an official, but nevertheless be noticed by a player who subsequently goes after the provoker.

I think one way to resolve this is to take a page out of moral theory.  Let’s say that an enforcer is a player who goes after players on the other team who do something provocative.  Let’s further say that there are good definitions for what it means to go after somebody and to be provocative in the context of a hockey game.  Then we expect that a good enforcer should go after opponents who are being provocative.  In other words, the more a player retaliates, the better he is at enforcing.

The form in use here looks like this:


"quantity Q manifest as event E"
(Q manifests-as E)

"things with Q should perform events of type E"
(x has Q) => should(x do E)

"all things being equal, x has more Q having performed e"
Q(x|e) > Q(x|~e) c.p.

It’s pseudo-logic, for sure, but hopefully the intent is somewhat clear.  Looking at that pseudo-logic, there’s nothing that prevents a quantity from manifesting in more than one way.  And that’s a good thing, because good enforcers do more than retaliate – they also deter provocations.  If a player has reputation for retaliating harshly, then opponents may be less likely to provoke.  This result is even better for the enforcer’s team, since an enforcer’s retaliations sometimes land him in the penalty box.  Thus, we can say that a good enforcer plays in games where the opposing side doesn’t often provoke.  In other words, the less an opposing team provokes, the better an enforcer is doing.

I updated the analysis file with a new query to reflect this new bi-valued barce scale:

prefix nhl: <http://www.nhl.com/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?name (sum(?enfActs) as ?enf)  (sum(?totChances) as ?opp)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player .
  ?x nhl:value ?enfActs . ?x nhl:game ?game .
  ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
  ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
  ?z a nhl:GameRoster . ?z nhl:game ?game . ?z nhl:player ?player .
  ?z nhl:team ?team . ?player rdfs:label ?name .
  filter( ?team != ?otherTeam) }
group by ?name

The retaliation value is ?enf and the deterrent value is ?opp.  I haven’t gotten around to changing the variable names to something more suitable, so don’t read too much into them yet.

I also haven’t changed the definitions of nhl:Enforcement, nhl:ViolentPenalty, or nhl:EnforcerAction yet, so those are still on the docket for review.  I also opted to skip the Sparql Construct I used for the other two barce formulations and just spit the measurement out into a result set (which is interpreted as an Incanter dataset in seabass).  For purposes of ranking, the two values for this barce measurement can be multiplied (the inverse of the deterrent value has to be used, since enfoo is inversely proportional to the number of provocations).

Maybe I’ll get to some charts and analysis tomorrow, if the theory bug doesn’t bite me again.

My last post had some definitions to frame this analysis of enforcement (or, I suppose, the measurement of enfoo), so I may as well go ahead and formalize them.

Enforcement

 prefix nhl:
 construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
 _:x nhl:actor ?p1 . _:x nhl:value ?value }
 where {
 select ?g ?p1 (count(?e) as ?value)
 where {
 ?g nhl:play ?e . ?g nhl:play ?e2 .
 ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
 ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .
 ?e a nhl:EnforcerAction . ?e2 nhl:influences ?e }
 group by ?g ?p1 }

The enforcement facts tally up the number of retaliations performed by a player during a game.  The definition of an enforcement fact on a game-by-game basis (as opposed to period-by-period or minute-by-minute) is somewhat arbitrary.  However, there’s a lot of convenience here, since it’s easy to argue that hockey is most naturally partitioned by games.  If another temporal basis made sense, here is where the first definition adjustment would be made.

I should note that there’s been a slight departure from the previous post’s definition of Enforcer Action.  Previously, I only included a few kinds of penalties to count as enforcer actions (fighting, cross-checking, roughing, and unsportsmanlike conduct).  I went ahead and included hits, since it made sense as I was typing out the Construct queries – after all, enforcers that get away with a good retaliatory hit are doing their job better, right?

influences

[rule1: (?e1 nhl:influences ?e2) <-
    (?e1 nhl:localtime ?t1), (?e2 nhl:localtime ?t2),
     le(?t1, ?t2), diff-minute(?t1, ?t2, ?diff),
     lessThan(?diff, 2)  ]

I defined the relation influences to tie events together when one takes place two minutes before another.  I couldn’t think of much else in the dataset to represent when one event is a significant cause for another, and a two minute time frame should limit the number of retaliations against a player.  Of course, it’s still an arbitrary time limit, and it’d be interesting to explore other ways of identifying causal influence.

The rest of the details can be found here on github.  As an aside, one thing I’ve learned is to keep the repository names short and sweet; this repository is just one huge hyphenated mess.

If you take a gander at the analysis.clj file, you’ll see that I changed up the manner in which the models are built up, favoring a map-reduce strategy.  I played around with pmap to see if I could get a bump out of parallelization, but ran into threading problems with the reasoner.  I’m definitely not a parallelization kind of guy, so I stuck with what worked for me 🙂

There are two scales defined in the analysis file:

barce

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce ?value }
where {
  select ?player (sum(?enfActs) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs  }
  group by ?player }

barce-opp

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce-opp ?value }
where {
  select ?player (sum(?enfActs/ ?totChances) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs . ?x nhl:game ?game .
    ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
    ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
    ?z a nhl:GameRoster . ?z nhl:game ?game .
    ?z nhl:player ?player . ?z nhl:team ?team .
    filter( ?team != ?otherTeam) }
  group by ?player }

The barce scale is pretty straightforward – count up all the enforcement facts for a player over all the games in the model you’re analyzing.  The advantage is in its simplicity, both for query performance and for clarity – it’s fairly easy to explain the meaning of barce measurements to your friends.

The barce-opportunity scale is a bit more complicated.  In earlier posts, I referred to a normalization of the scale, and had this in mind.  Problem is, this isn’t normalization at all.  Rather, for any particular game and player, the barce-opportunity of that player in that game is the percentage of times that the other team did something dodgey and the enforcer retaliated.  So if the opposing team provoked the enforcer’s team four times, and the enforcer retaliated twice, his barce-opportunity for that game is one-half.

For an overall barce-opportunity measure across games, I just added them up.  The problem is that measurements on the scale can no longer be understood as percentages (since you can’t just add percentages willy-nilly).  I suppose I could average them to keep the values between zero and one, which would have the advantage of a clear interpretation for measurements across games:

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce-opp ?value }
where {
  select ?player (avg(?enfActs/ ?totChances) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs . ?x nhl:game ?game .
    ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
    ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
    ?z a nhl:GameRoster . ?z nhl:game ?game .
    ?z nhl:player ?player . ?z nhl:team ?team .
    filter( ?team != ?otherTeam) }
  group by ?player }

I’ll have to think on that a bit more.  For now, I’m fairly content with the definitions, and will next see what kinds of charts and patterns fall out of all this.

Before returning to the analysis of the NHL play-by-play data I’ve been looking at, here are some definitions to frame the project:

Enfoo:  The disposition to respond to aggressive play through retaliatory checking and fighting.
Enforcer: A person considered to have a high degree of Enfoo.
Enforcement: The kind of event that is the realization of an Enfoo.

Because no good names for the (dispositional) quantity under the scope came to mind, I chose a silly name.  Despite this, I think there’s sufficient reason to think that a crude order could be created by a group of hockey fans.  For a physical comparison, these three definitions are similar to the following:

Weight: The gravitational force on a body.
Heavy Person: A person considered to weigh a lot.
Gravitational Attraction: The kind of event where two objects attract one another due to their mass and mutual distance.

Dispositions can be tricky things to identify, so we’ll assert the following detection rule:

Detection Rule: An event is an enforcement if it is an Enforcer Action taking place within 2 minutes after a Violent Action against one of the teammates of the Enforcer.

The following definitions round out the necessary machinery:

Enforcer Action: An action penalized as fighting, cross-checking, roughing, or unsportsmanlike conduct.
Violent Action: An action penalized as charging, high-sticking, roughing, slashing, or unsportsmanlike conduct.

The goal of this analysis is to come up with a formal definition of enfoo and an appropriate measurement scale that can aid future analyses.  It would be nice if the order generated by the detection rule above was similar to rankings of enforcers by hockey analysts, but this is not necessary.  If this were a material definition of enfoo, then the goal would be to uncover the intended semantics of a group of hockey analysts that rank enforcers.  Instead, this formal definition of the disposition only depends on the semantics of the terms in its definitions (like Violent Action, Player, and Minutes).

Finally, a scale for measuring enfoo needs to be defined:

Barce: A scale for measuring the amount of enfoo a player had during a period of time.  A player has one barce when he is the agent of an enforcement once during the set period of time.  A player has (x+1) barce when he is the agent of (x+1) events of enforcement during the set period of time.

The barce scale unsurprisingly jives with the detection rule asserted above – to measure the enfoo of a player during a game on the barce scale, the enforcement events he is penalized for are tallied.  If a normalized barce scale is desired, then the tally for the game should be divided by the total possible number of enforcement events: the number of violent actions that the other team is penalized for.

I just added some built-ins for Jena rules in seabass that’ll let me figure out how many seconds/minutes/etc apart two times/dates/datetimes are, which lets me write a new definition for Enforcement that tracks that wikipedia definition a bit more closely:

[rule1: (?e1 nhl:influences ?e2) <-
        (?e1 nhl:localtime ?t1), (?e2 nhl:localtime ?t2),
        le(?t1, ?t2), diff-minute(?t1, ?t2, ?diff),
        lessThan(?diff, 5)  ]
nhl:Penalty	rdfs:subClassOf nhl:EnforcerAction .
nhl:Hit	rdfs:subClassOf nhl:EnforcerAction .
 construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
                 _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (count(?e) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
	?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
	?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .
	?e a nhl:EnforcerAction . ?e2 nhl:influences ?e }
  group by ?g ?p1 }

Essentially, I stipulated that an earlier event influences a later event if they take place within five minutes. Best rule ever? Nope. Good enough? Maybe! So we have incidents of enforcement in a game when a Violent Penalty by a player influences an Enforcer Action against that player. Let’s see how this compares to the previous two definitions of enforcement:

The normalization is all off, largely because it’s getting late.  I should in fact normalize this score on all the opportunities for enforcement, which is really all the Violent Penalties against the enforcer’s team.

This dataset is drawn from the first 20 games of the 1230-game season, so this certainly isn’t a representative sample.  The bad news is that 20 games of data is just about as much as my laptop will crunch before I get impatient.  The good news is that I can re-jigger the analysis process to pull enforcement facts out of the season at 10-game increments…sort of like a map-reduce, I guess.  Should be fun!

In my last post I suggested a graph pattern, or detection rule, to measure the enforcement that a hockey player is bringing to his team.  I’m going to modify the rule slightly, to track the wikipedia definition more closely:

prefix nhl: <http://www.nhl.com/>
construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
            _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (sum(?minutes) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
    ?e a nhl:EnforcerPenalty . ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
    ?e a ?class . ?class nhl:penaltyMinutes ?minutes .
    ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .  }
  group by ?g ?p1}

While this captures the spirit of the wikipedia definition, it does move away from the apparent semantics, it does depart from the apparent semantics by substituting ‘enforcer penalties’ for hits.  That is, a player is acting like an enforcer when he gets penalized for certain activities in retaliation for violent penalties against his team.  So, an alternative detection rule might be a better

prefix nhl: <http://www.nhl.com/>
construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
            _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (count(?e) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
    ?e a nhl:Hit. ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
    ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .  }
  group by ?g ?p1 }

Neither rule reflects an arguably crucial facet of the definition, namely the temporal ordering of events: the enforcer has to retaliate after a violent penalty (and most likely, within a certain time interval).  I’m going to leave the temporal ordering aside for now, just to compare the results of the detection rules thus far.

First, some ontology needs to be written to define what is meant by an enforcer penalty and a violent penalty.  I could consult a subject matter expert (ie Wikipedia).  However, I’ll settle for an arbitrary definition of these classes, as I’m still rather early in the analysis and can return to this later.  Here’s the ontology so far (in Turtle syntax, and omitting the prefix declarations:

nhl:EnforcerPenalty rdfs:subClassOf nhl:Penalty .
  nhl:CrossChecking    rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:Fight            rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:FightingMaj      rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:Roughing         rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:Unsportsmanlike  rdfs:subClassOf nhl:EnforcerPenalty .

nhl:ViolentPenalty rdfs:subClassOf nhl:Penalty .
  nhl:Charging         rdfs:subClassOf nhl:ViolentPenalty .
  nhl:HighSticking     rdfs:subClassOf nhl:ViolentPenalty .
  nhl:Roughing         rdfs:subClassOf nhl:ViolentPenalty .
  nhl:Slashing         rdfs:subClassOf nhl:ViolentPenalty .
  nhl:Unsportsmanlike  rdfs:subClassOf nhl:ViolentPenalty .

nhl:Charging                 nhl:penaltyMinutes	2 .
nhl:CrossChecking            nhl:penaltyMinutes	2 .
nhl:Fight                    nhl:penaltyMinutes 5 .
nhl:FightingMaj              nhl:penaltyMinutes	10 .
nhl:HighSticking             nhl:penaltyMinutes	2 .
nhl:Roughing                 nhl:penaltyMinutes	2 .
nhl:Slashing                 nhl:penaltyMinutes	2 .
nhl:Unsportsmanlike          nhl:penaltyMinutes	2 .

I’m leaving out a few penalties, and definitely want to re-consider the use of these definitions in the detection rules – there’s a bit too much arbitrariness to these definitions.  Nonetheless, let’s wrap these up and see what kind of results we get.  The following queries will be used to define the target dataset:

prefix nhl: <http://www.nhl.com/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?name (?enfMinutes / ?totMinutes as ?value)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player .
        ?x nhl:value ?enfMinutes . ?x nhl:game ?game . 
        ?game nhl:totalPenaltyMinutes ?totMinutes .
        ?player rdfs:label ?name }
order by desc(?value)
prefix nhl: <http://www.nhl.com/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?name (?enfHits/ ?totHits as ?value)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player . 
        ?x nhl:value ?enfHits . ?x nhl:game ?game . 
        ?game nhl:totalHits ?totHits . ?player rdfs:label ?name }
order by desc(?value)

This query attempts to normalize the enforcement in each game by dividing the enforcement score (# of hits or the sum of the enforcer penalties) by the total number of penalty minutes in the game.  In the interest of time, I’m only running this over the first 10 games of the season.  The REPL activity using seabass and this code is:

(view (bounce enforcement1 m1))
(view (bounce enforcement2 m2))

The RDF models are m1 and m2, corresponding to the two different definitions of enforcement.  The enforcement query is a Sparql Select statement ‘bounced’ against each model using the seabass bounce function (and using Incanter’s view function to pop up some fancy Swing windows).  I’ll hold off on an analysis until I refine those detection rules, but putting those results side-by-side shows that there is some overlap, but the top five results on the two lists are disjoint.  So it looks like there’s a significant semantic divergence between those two detection rules.  I wonder if adding in the temporal aspect to those rules would line them up much.

I like hockey, but haven’t followed it in years; it seems that the Sabres are doing pretty well, and the Rangers stink. Frick – looks like the Flyers beat Buffalo out of the first round. Ugh.

Point being, I’m certainly not a subject matter expert in hockey. Fortunately, the internet is pretty smart about such things, so I can ask it for a definition of Enforcer:

An enforcer’s job is to deter and respond to dirty or violent play by the opposition. When such play occurs, the enforcer is expected to respond aggressively, by fighting or checking the offender. Enforcers are expected to react particularly harshly to violence against star players or goalies. (wikipedia)

So what I’d like to do is figure out a good way to measure how much enforcement a player is bringing to his team at any point in the season.  Apparently, I’m looking for people who pick fights with or hit people who do violent things to their teammates.

Although I could take this as grounds for a definition of the class of enforcers in an ontology, I’m not sure a class would do me much good.  There’s nothing in the expert definition that looks like either a necessary or sufficient condition.  What we have instead is a behavior to help us look for enforcers – a detection rule.

Looking at the NHL dataset parsed in the last post, and running it through a script to generate RDF, the skeleton of the data model looks like this:

  • game
    • play
      • hit
      • shot
      • penalty
      • goal
      • fight

As a first stab at a detection rule, then, this Sparql query might do the trick:

select ?p1 ?name (sum(?minutes) as ?score)
where {	?e a nhl:EnforcerPenalty . ?e nhl:agent1 ?p1 .
        ?e nhl:agent2 ?p2 . ?e2 nhl:agent1 ?p2 . 
        ?e2 a nhl:ViolentPenalty . ?p1 rdfs:label ?name .
        ?e a ?class . ?class nhl:penaltyMinutes ?minutes .}
group by ?p1 ?name

Essentially, somebody is acting like an enforcer when they commit an enforcer penalty against a player who committed a violent penalty against one of the enforcer’s teammates.  The quantitative aspect of the measure is conveyed by the minutes assessed by the enforcer’s penalty, so a guy who gets into a fight to stand up for a teammate is more of an enforcer than a guy who just gets called for roughing.

Yeah, it could use some work, but it’d be interesting to see what gets pulled out of the model by this query.  A bit of ontology will also be needed to define what enforcer and violent penalties are, as well as assign penalty minutes to each penalty class (since they don’t appear in the data set).

I’ll throw up the NHL project on Github tomorrow if you want to follow along, and run through some of the preliminary results from this definition of enforcement.  After that, I want to take a good look at the basis for calling this a measure at all.