You are currently browsing the category archive for the ‘seabass’ category.

And by best, I mean the most physical as measured by hits.  In a previous post, I thought it would be interesting to predict the most physical games in a season based on the previous actions of the players.  So a good first step is to measure each game by its physicality and save off the data.

Unfortunately, the raw data I’m working with doesn’t provide identifiers for the games themselves.  People, teams, and game events (like hits) get id’s, but the games end up with a blank node.  Kind of ruins the whole ‘list the games by physicality’ idea, so on to plan B: pairings.

Since teams get identifiers in the dataset, I could get the home and visiting teams for each.  Most teams in a season will play more than twice, so a pairing for “New York Rangers at Buffalo Sabres” will include two games, as will “Buffalo Sabres at New York Rangers”.  Not all pairings will be symmetric, though – the Rangers hosted the LA Kings for the only match-up between the teams in 2010-11.

The game data is broken up into 1230 n-triples files, each corresponding to a game.  I ended up naming them file-1.nt, file-2.nt, etc.  This is convenient for defining functions that build an RDF model from each game from N to M (e.g. file-23 to file-100):

(defn get-data[ n m] (map #(str "data/file-" % ".nt") (range n (+ 1 m))))
(defn get-model [n m] (apply build (get-data n m)))

Since physicality is based on the hits recorded in each game, a good place to start is a construct query I called ‘game-hits’.

(def game-hits "
  prefix nhl: <>
  construct { _:z a nhl:GameHits . _:z nhl:hometeam ?x .
	      _:z nhl:awayteam ?y . _:z nhl:game ?g . _:z nhl:value ?v }
	{ select ?x ?y ?g (count(?h) as ?v)
	  {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
	     ?g nhl:play ?h . ?h a nhl:Hit . filter(?x != ?y)}
	  group by ?x ?y ?g }

This construct has to be pulled from each file of game data I have for the 2010-11 season, so I next write two short functions to run through each game file and execute the construct, iteratively building a model of just the facts I need.  This way, I keep the model size down.

(defn build-game-model [s]
  (let [  n1 (get-model (first s) (last s)) ]
    (build (pull game-hits n1))	))

(defn build-model [n m binsize]
  (reduce build (map #(build-game-model %) (partition-all binsize (range n (+ m 1)))))	)

Finally, I’m ready to save off my first batch of summary data: the number of hits for each home-away team pairing, and the number of games that constitute that pairing.

(def teams-hits "
prefix nhl: <>
construct { _:z a nhl:HitsSummary . _:z nhl:hometeam ?x .
	    _:z nhl:awayteam ?y . _:z nhl:games ?games . 
        _:z nhl:value ?value }
	{ select ?x ?y (count(?g) as ?games) (sum(?v) as ?value)
	  { ?z a nhl:GameHits . ?z nhl:hometeam ?x .
	    ?z nhl:awayteam ?y . ?z nhl:game ?g . ?z nhl:value ?v . 
        filter(?x != ?y)}
	group by ?x ?y	}

I’ve described how I pulled geo-coordinates for each team in a previous post, so I’ll skip the details and note the final construct query that joins the DBpedia data with my raw NHL data:

(def city-coords "
prefix nhl: <>
prefix dbo: <>
prefix dbp: <>
prefix rdfs: <>
construct { ?x nhl:latitude ?lat . ?x nhl:longitude ?lon . 
            ?x nhl:cityname ?cityname . ?x nhl:name ?a }
{ ?x a nhl:Team . ?y a dbo:HockeyTeam .
  ?x rdfs:label ?a . ?y rdfs:label ?b . filter(str(?a) = str(?b))
  ?y nhl:latitude ?lat . ?y nhl:longitude ?lon . ?y dbp:city ?cityname}

Finally, I have a a select query that defines the information I want for my physicality scale (with geo-coordinates thrown in for the home team for kicks):

(def hits-summary-names "
prefix nhl: <>
select ?visitor ?home (?v / ?g as ?avg) ?lat ?lon
{ ?z a nhl:HitsSummary . ?z nhl:hometeam ?x . 
  ?x nhl:latitude ?lat . ?x nhl:longitude ?lon .
  ?z nhl:awayteam ?y . ?z nhl:games ?g . ?z nhl:value ?v . 
  filter(?x != ?y) .
  ?x nhl:name ?home . ?y nhl:name ?visitor }
order by desc(?avg)

At this point I’m ready to run my program, which consists of just a few lines of code:

(stash (pull teams-hits (build-model 1 1230 5)) "files/teams-hits.nt")
(stash (pull team-city-constr dbp) "files/team-city.nt")
(stash (pull city-coords 
          (build "files/geo-facts.nt" "files/team-city.nt" 
                 "files/ontology.ttl" "files/nhl.rules" (get-model 1 35))) 
(def m (build "files/city-coords.nt" "files/teams-hits.nt"))
(view (histogram :avg :data (bounce hits-summary-names m) :nbins 5))
(view (bounce hits-summary-names m))

The distribution of physicality per pairing:

The top pairings by physicality:

The top teams by physicality:

The full code can be found on GitHub.  The next step is to figure out what factors can be used to predict this physicality distribution.  Enforcement will be one of them, but it would help to have a slew of others to mix in.  At this point, I’m thinking a principal components analysis would be interesting to run.


I was all ready to grab some geo-data and see where I could find a good physical game of hockey.  I’d browsed dbpedia and saw how to connect NHL teams, the cities they play in, and the lat-long coordinates of each.  I was so optimistic.

Then it turns out that DBpedia wanted me to do a special dance.  I’ve been quite happy with the features of the upcoming Sparql 1.1 spec.  Since ARQ stays on top of the spec, I’ve managed to forget what Sparql 1.0 was missing.  Well, ‘if’ clauses for one, but I managed to design around that in my last post.  A real sticker, though, was the inability to wrap a construct query around a select query, like so:

prefix rdfs: <>
prefix dbo: <>
prefix dbp: <>
prefix nhl: <>
construct { ?team nhl:latitude ?v . ?team nhl:name ?name }
{	select ?team ?name ( 1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
	{ 	  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
		  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
		  ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
		  filter ( lang(?name) = 'en') }}

The reason this is critical is that you can’t inject those arithmetic expressions into a construct clause.  And since I plan on working with the resulting data using Sparql, simply using select queries isn’t going to do it.

Thus, we need to break down the steps a bit more finely.  First, I’ll pull out the basic triples I intend to work with:

prefix rdfs: <>
prefix dbo: <>
prefix dbp: <>
prefix nhl: <>
construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
                ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s . }
   {   ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
        filter ( lang(?name) = 'en') }

And crap – the data doesn’t fit very well.  Looks like the city names associated with hockey teams don’t cleanly match up to the cities we’re looking for in DBpedia.  Time for a second refactor…

After a few minutes of staring at the ceiling, I realized that I could use Google’s geocoding service to do my bidding.  Since their daily limit is 2500 requests, my measly 50ish cities would be well under.  So first, I grab just the info I need out of DBpedia – hockey teams and the cities they’re associated with:

 prefix rdfs: <>
 prefix dbo: <>
 prefix dbp: <>
 prefix nhl: <>
 construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                 ?team dbp:city ?cityname . ?city rdfs:label ?cityname . }
      { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
         ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
         filter ( lang(?name) = 'en') }

And use this query with a bit of Clojure to pull out my geocoding facts, saving them off as an n-triples files:

prefix rdfs: <>
prefix dbo: <>
prefix dbp: <>
prefix nhl: <>
select distinct ?team ?name ?city ?cityname
{ ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
  filter ( lang(?name) = 'en') }
(defn get-geo-fact [row]
(let [ n (string/replace (:cityname row) " "  "+")
  x (json/read-json (slurp (str
  g (:location (:geometry (first (:results x))))
  lat (str "<" (:city row) ">" 
           " <> " 
           (:lat g) " ." )
  lon (str "<" (:city row) ">" 
           " <> " 
           (:lng g) " ." )	]
[lat lon] ))

(defn make-geo-facts []
 (let [ a (bounce team-city dbp)
  f "files/geo-facts.nt" ]
 (spit f (string/join "\n" (flatten (map get-geo-fact (:rows a)))))	))

The results are created with the following two calls at the REPL:

(stash (pull team-city-constr dbp) "files/team-city.nt")

Now that I have geo-data, I can finish with hits-per-game as a rough cut at a physicality scale for games, and see where the action was this season.  I wonder if the Islanders and Rangers still go at it.

A few posts back, I wrote the following as a rule for detecting enforcement events:

An event is an enforcement if it is an Enforcer Action taking place within 2 minutes after a Violent Action against one of the teammates of the Enforcer.

Since then, a few parts of this rule have been bugging me:

  • what’s an enforcer action?
  • why two minutes?
  • what’s a violent action?

So I’m going to take a step back and look at this again.  I’m defining enforcement event to be a manifestation of an enfoo quantity (like gravitational attraction would be for a mass quantity).  If these manifestations are what makes somebody an enforcer, then it may be that a pattern amongst these events can be extracted that determines an ordering of players sufficiently close to the original crude orderings of enforcers. So,

  1. hockey players can be crudely ordered according to their capabilities as an enforcer.
  2. I dub this capability enfoo
  3. I hypothesize that enfoo manifests in certain ways that can be analyzed to yield an ordering of players according to their degree of enfoo
    • That is, I hypothesize that enfoo can be measured via enfoo manifestations, which I call enforcements
  4. A scale for measuring enfoo via enforcements is defined, and is called the barce scale.
  5. Barce measurements are made on NHL play-by-play data, and the resulting player orderings are compared to crude orderings.
    • If the barce orderings track the crude orderings closely, then the barce scale captures the intuitions of the creators of the crude orderings (i.e. experts).
    • If the barce and crude orderings are very different, then all is not lost.  If the semantics of the barce scale are interesting, then the scale provides a new perspective on the concept of enforcement.

This all tracks the 1968 work of Brian Ellis quite well (this would be a fundamental derived measurement in his theory).  However, not everything rests on solid ground here.  Ellis didn’t really discuss social quantities (like enfoo), instead concentrating on physical quantities like mass, pressure, and electrical resistance.  Thus, the idea of looking at events (manifestations of enfoo) brings with it some difficulties.

One difficulty is that the units of the barce scale are quite unlike grams and meters.  When defining the meter, experts talk about how far light travels in a vacuum in a very short period of time.  The gram is based on an actual object known as the IPK (International Prototype Kilogram).

The barce scale introduces two kinds of difficulties, which I alluded to at the top of this post: subjectivity and causal influence.

The events of a hockey game, such as penalties and goals, are only identified by certain officials on the ice and in the booth.  It doesn’t matter whether you or I think a hit was a cross-check, even if the officials for that game agree with us the next day – if an official doesn’t identify the hit as a cross-check (thus penalizing the hitter) during the game, then it’s not a cross-check.  The events of a hockey game, then, cannot be be determined without knowing the officials’ in-game decisions.  Fortunately, these decisions are themselves objectively verifiable due to their documentation.

A more worrisome difficulty is the detection of causal influence.  Enforcements are defined as events that are in some sense caused by a provocation by a member of the other team.  For example, we might consider the following causal influences:

  1. An enforcement would not have occurred had it not been for the provocation.
  2. A provocation increased the probability of an enforcement occurring.
  3. A provocation inclined a player to perform an enforcement.

An added difficulty builds on the problem with subjectivity mentioned above: a provocation may not be noticed by an official, but nevertheless be noticed by a player who subsequently goes after the provoker.

I think one way to resolve this is to take a page out of moral theory.  Let’s say that an enforcer is a player who goes after players on the other team who do something provocative.  Let’s further say that there are good definitions for what it means to go after somebody and to be provocative in the context of a hockey game.  Then we expect that a good enforcer should go after opponents who are being provocative.  In other words, the more a player retaliates, the better he is at enforcing.

The form in use here looks like this:

"quantity Q manifest as event E"
(Q manifests-as E)

"things with Q should perform events of type E"
(x has Q) => should(x do E)

"all things being equal, x has more Q having performed e"
Q(x|e) > Q(x|~e) c.p.

It’s pseudo-logic, for sure, but hopefully the intent is somewhat clear.  Looking at that pseudo-logic, there’s nothing that prevents a quantity from manifesting in more than one way.  And that’s a good thing, because good enforcers do more than retaliate – they also deter provocations.  If a player has reputation for retaliating harshly, then opponents may be less likely to provoke.  This result is even better for the enforcer’s team, since an enforcer’s retaliations sometimes land him in the penalty box.  Thus, we can say that a good enforcer plays in games where the opposing side doesn’t often provoke.  In other words, the less an opposing team provokes, the better an enforcer is doing.

I updated the analysis file with a new query to reflect this new bi-valued barce scale:

prefix nhl: <>
prefix rdfs: <>
select ?name (sum(?enfActs) as ?enf)  (sum(?totChances) as ?opp)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player .
  ?x nhl:value ?enfActs . ?x nhl:game ?game .
  ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
  ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
  ?z a nhl:GameRoster . ?z nhl:game ?game . ?z nhl:player ?player .
  ?z nhl:team ?team . ?player rdfs:label ?name .
  filter( ?team != ?otherTeam) }
group by ?name

The retaliation value is ?enf and the deterrent value is ?opp.  I haven’t gotten around to changing the variable names to something more suitable, so don’t read too much into them yet.

I also haven’t changed the definitions of nhl:Enforcement, nhl:ViolentPenalty, or nhl:EnforcerAction yet, so those are still on the docket for review.  I also opted to skip the Sparql Construct I used for the other two barce formulations and just spit the measurement out into a result set (which is interpreted as an Incanter dataset in seabass).  For purposes of ranking, the two values for this barce measurement can be multiplied (the inverse of the deterrent value has to be used, since enfoo is inversely proportional to the number of provocations).

Maybe I’ll get to some charts and analysis tomorrow, if the theory bug doesn’t bite me again.

I just added some built-ins for Jena rules in seabass that’ll let me figure out how many seconds/minutes/etc apart two times/dates/datetimes are, which lets me write a new definition for Enforcement that tracks that wikipedia definition a bit more closely:

[rule1: (?e1 nhl:influences ?e2) <-
        (?e1 nhl:localtime ?t1), (?e2 nhl:localtime ?t2),
        le(?t1, ?t2), diff-minute(?t1, ?t2, ?diff),
        lessThan(?diff, 5)  ]
nhl:Penalty	rdfs:subClassOf nhl:EnforcerAction .
nhl:Hit	rdfs:subClassOf nhl:EnforcerAction .
 construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
                 _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (count(?e) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
	?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
	?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .
	?e a nhl:EnforcerAction . ?e2 nhl:influences ?e }
  group by ?g ?p1 }

Essentially, I stipulated that an earlier event influences a later event if they take place within five minutes. Best rule ever? Nope. Good enough? Maybe! So we have incidents of enforcement in a game when a Violent Penalty by a player influences an Enforcer Action against that player. Let’s see how this compares to the previous two definitions of enforcement:

The normalization is all off, largely because it’s getting late.  I should in fact normalize this score on all the opportunities for enforcement, which is really all the Violent Penalties against the enforcer’s team.

This dataset is drawn from the first 20 games of the 1230-game season, so this certainly isn’t a representative sample.  The bad news is that 20 games of data is just about as much as my laptop will crunch before I get impatient.  The good news is that I can re-jigger the analysis process to pull enforcement facts out of the season at 10-game increments…sort of like a map-reduce, I guess.  Should be fun!

In my last post I suggested a graph pattern, or detection rule, to measure the enforcement that a hockey player is bringing to his team.  I’m going to modify the rule slightly, to track the wikipedia definition more closely:

prefix nhl: <>
construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
            _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (sum(?minutes) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
    ?e a nhl:EnforcerPenalty . ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
    ?e a ?class . ?class nhl:penaltyMinutes ?minutes .
    ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .  }
  group by ?g ?p1}

While this captures the spirit of the wikipedia definition, it does move away from the apparent semantics, it does depart from the apparent semantics by substituting ‘enforcer penalties’ for hits.  That is, a player is acting like an enforcer when he gets penalized for certain activities in retaliation for violent penalties against his team.  So, an alternative detection rule might be a better

prefix nhl: <>
construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
            _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (count(?e) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
    ?e a nhl:Hit. ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
    ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .  }
  group by ?g ?p1 }

Neither rule reflects an arguably crucial facet of the definition, namely the temporal ordering of events: the enforcer has to retaliate after a violent penalty (and most likely, within a certain time interval).  I’m going to leave the temporal ordering aside for now, just to compare the results of the detection rules thus far.

First, some ontology needs to be written to define what is meant by an enforcer penalty and a violent penalty.  I could consult a subject matter expert (ie Wikipedia).  However, I’ll settle for an arbitrary definition of these classes, as I’m still rather early in the analysis and can return to this later.  Here’s the ontology so far (in Turtle syntax, and omitting the prefix declarations:

nhl:EnforcerPenalty rdfs:subClassOf nhl:Penalty .
  nhl:CrossChecking    rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:Fight            rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:FightingMaj      rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:Roughing         rdfs:subClassOf nhl:EnforcerPenalty .
  nhl:Unsportsmanlike  rdfs:subClassOf nhl:EnforcerPenalty .

nhl:ViolentPenalty rdfs:subClassOf nhl:Penalty .
  nhl:Charging         rdfs:subClassOf nhl:ViolentPenalty .
  nhl:HighSticking     rdfs:subClassOf nhl:ViolentPenalty .
  nhl:Roughing         rdfs:subClassOf nhl:ViolentPenalty .
  nhl:Slashing         rdfs:subClassOf nhl:ViolentPenalty .
  nhl:Unsportsmanlike  rdfs:subClassOf nhl:ViolentPenalty .

nhl:Charging                 nhl:penaltyMinutes	2 .
nhl:CrossChecking            nhl:penaltyMinutes	2 .
nhl:Fight                    nhl:penaltyMinutes 5 .
nhl:FightingMaj              nhl:penaltyMinutes	10 .
nhl:HighSticking             nhl:penaltyMinutes	2 .
nhl:Roughing                 nhl:penaltyMinutes	2 .
nhl:Slashing                 nhl:penaltyMinutes	2 .
nhl:Unsportsmanlike          nhl:penaltyMinutes	2 .

I’m leaving out a few penalties, and definitely want to re-consider the use of these definitions in the detection rules – there’s a bit too much arbitrariness to these definitions.  Nonetheless, let’s wrap these up and see what kind of results we get.  The following queries will be used to define the target dataset:

prefix nhl: <>
prefix rdfs: <>
select ?name (?enfMinutes / ?totMinutes as ?value)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player .
        ?x nhl:value ?enfMinutes . ?x nhl:game ?game . 
        ?game nhl:totalPenaltyMinutes ?totMinutes .
        ?player rdfs:label ?name }
order by desc(?value)
prefix nhl: <>
prefix rdfs: <>
select ?name (?enfHits/ ?totHits as ?value)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player . 
        ?x nhl:value ?enfHits . ?x nhl:game ?game . 
        ?game nhl:totalHits ?totHits . ?player rdfs:label ?name }
order by desc(?value)

This query attempts to normalize the enforcement in each game by dividing the enforcement score (# of hits or the sum of the enforcer penalties) by the total number of penalty minutes in the game.  In the interest of time, I’m only running this over the first 10 games of the season.  The REPL activity using seabass and this code is:

(view (bounce enforcement1 m1))
(view (bounce enforcement2 m2))

The RDF models are m1 and m2, corresponding to the two different definitions of enforcement.  The enforcement query is a Sparql Select statement ‘bounced’ against each model using the seabass bounce function (and using Incanter’s view function to pop up some fancy Swing windows).  I’ll hold off on an analysis until I refine those detection rules, but putting those results side-by-side shows that there is some overlap, but the top five results on the two lists are disjoint.  So it looks like there’s a significant semantic divergence between those two detection rules.  I wonder if adding in the temporal aspect to those rules would line them up much.

Last week I committed my first project to Github: seabass.  I wanted to work with RDF in a statistical computing environment, and was starting to learn R.  By luck, I happened on the Incanter library and ended up really digging Clojure.  It certainly wasn’t rocket science to write seabass, especially since I could slide Jena in underneath.

So why write something like this?  Well, working through an RDF dataset with a REPL is pretty sweet.  Jena gives me their general-purpose rule engine to write axioms (particularly the kind that OWL makes cumbersome), and Incanter gives me all kinds of stats stuff (as well as some charting).  And hell, there’s always Rincanter in case I really want some R functions.

Bah, enough tag-dropping… this should do for a Hello World post 🙂