You are currently browsing the category archive for the ‘RDF/SPARQL’ category.

Looks like Egon updated his rrdf library (an RDF library for R) to 1.2, which now includes remote sparql queries and constructs. Pretty neat!  I’ve been meaning to get into R, and keep getting pulled away by other shiny things.  For example, my employer got me into a Cloudera class this week to get all MapReducey.  Lots of fun, but I really have to put my head down and learn that damned language, even with its lack of parentheses.

Speaking of MapReduce, I wonder if there’s a good SPARQL/RDF library out there to support MapReduce jobs.  I’d think you’d want code that was optimized for queries against graphs that take up either 64MB or 128MB (the two most common block sizes).  I’m guessing that a fact in an N-TRIPLES file takes 1/10KB, so the graphs would have either 640,000 or a bit over 1.2 million facts in them… that’s a pretty healthy volume of information, but we’re not talking DBpedia-sized here.

Finally, my search for mathematical podcasts is finally turning up fruit – looks like Sam Hansen at ACME Science has a slew of podcasts right up this alley.  Hell, his most recent podcast is an interview with John Cook (of The Endeavor fame).  In fact, he’s got a new one in the works that looks to be focused on stories in the world of mathematicians – if you’re curious, check out http://bit.ly/relprime.

More than finally, this is my first post on a linux box.  I took the plunge dipped my toes in the Linux waters by installing an Ubuntu partition on my laptop with Wubi.  There’s been a few hiccups so far with learning about file permissions and tar files, but I managed to run the seabass tests successfully, so I’m in a good place.

I was prepping my hockey data today to try out a few machine learning algorithms, and ran into a bit of a pickle.

One of the metrics I wanted to extract was the number of hits per team per game, so I tapped in this query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

The query took a little longer to run than expected, but I chalked that up to cosmic radiation. When I looked at my results, though, I was a bit shocked: over 300,000 answers! Sure, there are 1230 games in a season, and I was extracting 32 facts for each one, but that’s a whole order of magnitude lower that what I was seeing.

After a little while full of WTFs and glances at the SPARQL spec for aggregations over optional clauses, it hit me – I had overloaded the nhl:team property.  Both games and events had nhl:team facts (nhl:team is a super-property of both nhl:hometeam and nhl:awayteam):

nhl:game-1 nhl:hometeam nhl:team-10 .
nhl:game-1 nhl:awayteam nhl:team-8 .
nhl:eid-TOR51 nhl:team nhl:team-10 .

I had a choice: either I could redo my JSON parse or my SPARQL query.  Both required about the same amount of effort, but a full parse of all the game files would take over 40 minutes.  Thus, I rewrote the query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g a :Game; :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

To make this work, I had to add the facts that each game was in fact an instance of the nhl:Game class.  Fortunately, I had two properties whose domain was precisely hockey games:

nhl:hometeam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

nhl:awayteam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

As luck would have it, somebody at tonight’s SemWeb meetup group wondered aloud whether anybody found rdfs:domain and rdfs:range useful.  Neat-o.

A bit of discussion on URIs in the T-Box has been popping up lately on the W3C’s Semantic Web mailing list.  At issue in a thread on “Best Practice for Renaming OWL Vocabulary Elements” is whether T-Box URIs should be somewhat meaningful names, like foo:fatherOf, or meaningless names, like foo:R9814.

There are pros and cons for both sides.  Meaningful T-Box URIs are easier to to use when making ontologies and writing queries.  However, it’s very difficult to change a URI after you’ve created it (you may later become unsatisfied with an unforeseen connotation of your URI).  Moreover, you have to accept that the language you’re using (like English or Tagalog) may not be understood by a user.

Meaningless T-Box URIs have the pros and cons reversed – harder for creating ontologies and writing queries, easier for lifecycle management and (in theory) buy-in from non-native speakers.  To sweeten the deal for these meaningless URIs, advocates point out that tools can be written to correct for the difficulties in ontology and query authoring.

This all brings to mind a division in labor in the ontology/semantic web community, which you might call A-ontology and T-ontology (tracking the distinction between the A-Box and T-Box in description logics).

A-ontology is focused on analyzing data, leveraging the implicit semantics found in datasets.  Ontologies are a tool for declaring semantics so that

  • It’s clear what assumptions and interpretations have been made in the analysis
  • A reasoner can read and process the additional semantics

There’s no community effort to re-use the ontology going on, so these ontologies are narrowly purpose-driven.  Not to say the semantics comes cheap – a data analysis is ill-served by a rush to charts and graphs.

T-ontology is a bit different, primarily focussed on sharing re-usable terminology.  The ontology is not a tool, but rather the product of T-ontology work.  Communities of interest and stakeholders factor into the development of the ontology, since they and the people they represent are going to be using the fruits of this labor.

These two kinds of ontology work intersect in practice.  A-ontology will often use data that’s been published with semantics governed by the product of T-ontology.  If significant results are found when performing A-ontology work, the analysts may consider publication of their results, meaning a re-consideration of terminology and alignment with community standards.

A realization of the Linked Data dream of a globally-distributed data space is an undoubtedly good thing.  If meaningless T-Box URIs help this dream along, then we just need to be sure we’re not crimping the style of A-ontology.  If tools have to be written, then they need to fit the workflow of A-ontology before changes to RDF or SPARQL are made (and most modern data analysis takes place at the command line with scripts on the side – GUIs and faceted browsers won’t find a large audience here).

As things things currently stand (with RDF 2004 and Sparql 1.1), meaningless URIs would overburden A-ontology efforts.  It’s hard to imagine how I’d productively use the rrdf library (or rdflib+scipy or seabass+incanter or any ontology and data analysis library combination) if I had to deal with meaningless URIs in the T-Box.

When I first started playing around with Clojure, I was thinking a bit about orderings and how (according to Ellis’ theory of measurement) they’re fundamental to the entire notion of quantity.  Essentially, you can only have a quantity if you can say that something is bigger than something else in some way.

In RDF we represent orderings (like everything) with triples.  Since I only cared about a single ordering relation, it was safe to use ordered pairs to do this.  So, first thing I wrote was a little function to produce a map of ordered pairs.

(defn bar [num]
  (let [ a (map #(keyword (str %)) (range 1 (+ num 1)))
         b (map #(keyword (str %)) (range 2 (+ num 2))) ]
  (zipmap a b) ))

And a function to produce a vector of the elements of those ordered pairs in the proper order:

(defn infima [m] (first (difference (set (keys m)) (set (vals m)))) )
(defn reduction [m]
  (let [ ans [(infima m)] ]
    (loop [ L ans ]
      (let [ N (m (peek L)) ]
        (if (nil? N)
          (vec L)
          (recur (conj L N))  )))))

For example I could generate the order pairs for the numbers one through five, and then order them up.

(bar 4)
>> {:4 :5, :3 :4, :2 :3, :1 :2}
(reduction (bar 4))
>> [:1 :2 :3 :4 :5]

The performance was really good – I could recover an ordering out of 1,200,000 ordered pairs in 22 seconds.  Woo, right?

Well, since I was doing so well, I thought about the case where I ran into a triple store that declared the ordering relation I cared about to be transitive.  For example, I might want to recover the maternal ancestry line of Amanda Seyfried.  Let’s say I found a triple store with the relevant information, with the relevant property named hasMaternalAncestor.

If no reasoner is working behind the scenes, I could use my reduction function above.  However, I’ll need something else if hasMaternalAncestor is declared to be transitive.  And since linear orderings are transitive by definition, it’s not beyond belief to run into this situation.  So I wrote the following functions to do my dirty work:

(defn agg [element coll]
  "given a collection of binary sets, and an element e, 
  get a collection of the elements z s/t [e z] is in the collection"
  (map #(get % 1) (filter #(= (get % 0) element) coll))	)

(defn transitive-reduction [ordered-pairs]
  "we get the elements of the ordered-pairs that have a successor, 
  then we get a vector that relates each of those elements to the 
  number of successors it has (the count comes first, then the element).  Finally, we
  pull out the elements in order from this ordered vector (ie 'ranked'), 
  and return an ordered list"
  (let [ elements (union 
                    (set (map #(get % 0) ordered-pairs)) 
                    (set (map #(get % 1) ordered-pairs)) )
         ranked   (rseq (vec (into (sorted-map) 
                        (zipmap (map #(count (agg % ordered-pairs)) 
                                  elements) elements))) )
         ordered  (map #(get % 1) ranked)  ]
    (vec ordered)	))

And a new function to generate the appropriate pairs of “integers”:

(defn foo [num]
  (let [ v (vec (range 1 (+ 1 num))) ]
    (loop [ L #{}, V v ]
      (if (empty? V)
        (vec L)
        (recur (union L (set (map #(vector (keyword (str %)) 
                  (keyword (str (peek V)))) (pop V)))) (pop V))	))))

So I tried out a transitive reduction of the pairs for the integers from 1 to 1200.  Heck, if I can reduce over a million in 22 seconds, 1200 should be nothing, right?

(println (time (take 10 (transitive-reduction (foo 1200)))))

Three minutes later, I get my truncated answer.  I looked at my code over and over, trying to figure out what I did wrong.  I walked away, took a nap, and came back.  Still three minutes.  Oi!

Finally, I decided to figure out just how hard my algorithm was working.  How many pairs were involved in a transitive reduction of (foo 1200)?  Well, that’d be 1200! (i.e. the factorial of 1200):

6.3507890863456767124026223135865e+3175

Dang, that’s a lot.  I’m surprised I got an answer at all!  So yeah, counting is important.

I ran into this last night when I was looking at my definition of enforcement.  Initially, the plan was to look at a game and measure the enforcement for each player active in the game, and later aggregate these for each game for each team.

It sounded reasonable at first blush, until my queries and model building were taking way too long.  After too much time spent staring, I finally decided to count the triples:


~30 players x 2 teams x 1230 games x 4 facts per player = 295,200 facts

Yep, way too much.  My whole dataset is 50Mb on disk, and I was producing a 30Mb N-triples file to work with.  Oi.

Having realized the error of my way, I kept the same enforcement scale but only kept track of the players that actually took part in an enforcement.  So instead of 30ish players per team per game, I only tracked 3 or 4.  On top of that, I immediately extracted the team’s enforcement value for that game, meaning I only had


4 players x 2 teams x 1230 games x 4 facts per player = 39,360

Much better.  This may be one of the first lessons I teach my son when he’s old enough to start counting.

And by best, I mean the most physical as measured by hits.  In a previous post, I thought it would be interesting to predict the most physical games in a season based on the previous actions of the players.  So a good first step is to measure each game by its physicality and save off the data.

Unfortunately, the raw data I’m working with doesn’t provide identifiers for the games themselves.  People, teams, and game events (like hits) get id’s, but the games end up with a blank node.  Kind of ruins the whole ‘list the games by physicality’ idea, so on to plan B: pairings.

Since teams get identifiers in the dataset, I could get the home and visiting teams for each.  Most teams in a season will play more than twice, so a pairing for “New York Rangers at Buffalo Sabres” will include two games, as will “Buffalo Sabres at New York Rangers”.  Not all pairings will be symmetric, though – the Rangers hosted the LA Kings for the only match-up between the teams in 2010-11.

The game data is broken up into 1230 n-triples files, each corresponding to a game.  I ended up naming them file-1.nt, file-2.nt, etc.  This is convenient for defining functions that build an RDF model from each game from N to M (e.g. file-23 to file-100):

(defn get-data[ n m] (map #(str "data/file-" % ".nt") (range n (+ 1 m))))
(defn get-model [n m] (apply build (get-data n m)))

Since physicality is based on the hits recorded in each game, a good place to start is a construct query I called ‘game-hits’.

(def game-hits "
  prefix nhl: <http://www.nhl.com/>
  construct { _:z a nhl:GameHits . _:z nhl:hometeam ?x .
	      _:z nhl:awayteam ?y . _:z nhl:game ?g . _:z nhl:value ?v }
	{ select ?x ?y ?g (count(?h) as ?v)
	  {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
	     ?g nhl:play ?h . ?h a nhl:Hit . filter(?x != ?y)}
	  group by ?x ?y ?g }
")

This construct has to be pulled from each file of game data I have for the 2010-11 season, so I next write two short functions to run through each game file and execute the construct, iteratively building a model of just the facts I need.  This way, I keep the model size down.

(defn build-game-model [s]
  (let [  n1 (get-model (first s) (last s)) ]
    (build (pull game-hits n1))	))

(defn build-model [n m binsize]
  (reduce build (map #(build-game-model %) (partition-all binsize (range n (+ m 1)))))	)

Finally, I’m ready to save off my first batch of summary data: the number of hits for each home-away team pairing, and the number of games that constitute that pairing.

(def teams-hits "
prefix nhl: <http://www.nhl.com/>
construct { _:z a nhl:HitsSummary . _:z nhl:hometeam ?x .
	    _:z nhl:awayteam ?y . _:z nhl:games ?games . 
        _:z nhl:value ?value }
	{ select ?x ?y (count(?g) as ?games) (sum(?v) as ?value)
	  { ?z a nhl:GameHits . ?z nhl:hometeam ?x .
	    ?z nhl:awayteam ?y . ?z nhl:game ?g . ?z nhl:value ?v . 
        filter(?x != ?y)}
	group by ?x ?y	}
")

I’ve described how I pulled geo-coordinates for each team in a previous post, so I’ll skip the details and note the final construct query that joins the DBpedia data with my raw NHL data:

(def city-coords "
prefix nhl: <http://www.nhl.com/>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
construct { ?x nhl:latitude ?lat . ?x nhl:longitude ?lon . 
            ?x nhl:cityname ?cityname . ?x nhl:name ?a }
{ ?x a nhl:Team . ?y a dbo:HockeyTeam .
  ?x rdfs:label ?a . ?y rdfs:label ?b . filter(str(?a) = str(?b))
  ?y nhl:latitude ?lat . ?y nhl:longitude ?lon . ?y dbp:city ?cityname}
")

Finally, I have a a select query that defines the information I want for my physicality scale (with geo-coordinates thrown in for the home team for kicks):

(def hits-summary-names "
prefix nhl: <http://www.nhl.com/>
select ?visitor ?home (?v / ?g as ?avg) ?lat ?lon
{ ?z a nhl:HitsSummary . ?z nhl:hometeam ?x . 
  ?x nhl:latitude ?lat . ?x nhl:longitude ?lon .
  ?z nhl:awayteam ?y . ?z nhl:games ?g . ?z nhl:value ?v . 
  filter(?x != ?y) .
  ?x nhl:name ?home . ?y nhl:name ?visitor }
order by desc(?avg)
")

At this point I’m ready to run my program, which consists of just a few lines of code:

(stash (pull teams-hits (build-model 1 1230 5)) "files/teams-hits.nt")
(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)
(stash (pull city-coords 
          (build "files/geo-facts.nt" "files/team-city.nt" 
                 "files/ontology.ttl" "files/nhl.rules" (get-model 1 35))) 
  "files/city-coords.nt")
(def m (build "files/city-coords.nt" "files/teams-hits.nt"))
(view (histogram :avg :data (bounce hits-summary-names m) :nbins 5))
(view (bounce hits-summary-names m))

The distribution of physicality per pairing:

The top pairings by physicality:

The top teams by physicality:

The full code can be found on GitHub.  The next step is to figure out what factors can be used to predict this physicality distribution.  Enforcement will be one of them, but it would help to have a slew of others to mix in.  At this point, I’m thinking a principal components analysis would be interesting to run.

Reading through the internet’s definition of bigotry, I find myself nodding along as I make a comparison to my views on the Web Ontology Language.  Unfortunately, the definition of bigotry I found makes no mention of weakly-grounded intolerance, which is how I’d describe my own feelings.

There’s clearly nothing wrong with OWL (or OWL-2) as a logical system.  I’ve nothing against disjoint unions, datatype properties, or n-ary datatypes.  There are plenty of reasoners that cater to OWL semantics, and many IDEs (e.g. Protege and TopBraid Composer) weave OWL reasoning right into the developer’s workflow. It’s even defined in the RDF framework, so that using OWL is only a namespace declaration away.

My problem is that I never use OWL.  When working with data, I’ll often write an ontology to hold the semantics I’m adding to a model.  However, that ontology will virtually always be based on RDF/S – classes, domains, ranges, labels, and sequences constitute a pretty powerful palette.  When I want something over and above this, what I usually need is a rule system like Jena rules, SPIN, Common Logic, or RIF.

Bigotry may not be the word I’m looking for – the wiki page in fact makes an interesting suggestion that I may be suffering from RDF Purism.  Indeed, I do feel inexplicably better if I manage to express my ontology using only RDF/S.  Perhaps this is the same kind of irrational tendency that drives some people to prefer hammers to nail guns, others to write Lisp instead of Java, and a few to favor kites over model rockets – a tendency towards simplicity.

I’ve no idea how I’d write a knock-down argument for simplicity in semantic tooling, and it probably comes down to personal preference.  Lots of people like RDF/XML, TopBraid Composer, and Java, and I’d never suggest anything to be wrong with this tool kit.  Me, I like Turtle, Notepad++, Clojure, and a REPL.

I was all ready to grab some geo-data and see where I could find a good physical game of hockey.  I’d browsed dbpedia and saw how to connect NHL teams, the cities they play in, and the lat-long coordinates of each.  I was so optimistic.

Then it turns out that DBpedia wanted me to do a special dance.  I’ve been quite happy with the features of the upcoming Sparql 1.1 spec.  Since ARQ stays on top of the spec, I’ve managed to forget what Sparql 1.0 was missing.  Well, ‘if’ clauses for one, but I managed to design around that in my last post.  A real sticker, though, was the inability to wrap a construct query around a select query, like so:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:latitude ?v . ?team nhl:name ?name }
{	select ?team ?name ( 1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
	{ 	  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
		  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
		  ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
		  filter ( lang(?name) = 'en') }}

The reason this is critical is that you can’t inject those arithmetic expressions into a construct clause.  And since I plan on working with the resulting data using Sparql, simply using select queries isn’t going to do it.

Thus, we need to break down the steps a bit more finely.  First, I’ll pull out the basic triples I intend to work with:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
                ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s . }
   {   ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
        filter ( lang(?name) = 'en') }

And crap – the data doesn’t fit very well.  Looks like the city names associated with hockey teams don’t cleanly match up to the cities we’re looking for in DBpedia.  Time for a second refactor…

After a few minutes of staring at the ceiling, I realized that I could use Google’s geocoding service to do my bidding.  Since their daily limit is 2500 requests, my measly 50ish cities would be well under.  So first, I grab just the info I need out of DBpedia – hockey teams and the cities they’re associated with:

 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
 prefix dbo: <http://dbpedia.org/ontology/>
 prefix dbp: <http://dbpedia.org/property/>
 prefix nhl: <http://www.nhl.com/>
 construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                 ?team dbp:city ?cityname . ?city rdfs:label ?cityname . }
      { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
         ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
         filter ( lang(?name) = 'en') }

And use this query with a bit of Clojure to pull out my geocoding facts, saving them off as an n-triples files:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
select distinct ?team ?name ?city ?cityname
{ ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
  filter ( lang(?name) = 'en') }
(defn get-geo-fact [row]
(let [ n (string/replace (:cityname row) " "  "+")
  x (json/read-json (slurp (str
         "http://maps.googleapis.com/maps/api/geocode/json?address="
         n
         "&sensor=false")))
  g (:location (:geometry (first (:results x))))
  lat (str "<" (:city row) ">" 
           " <http://www.nhl.com/latitude> " 
           (:lat g) " ." )
  lon (str "<" (:city row) ">" 
           " <http://www.nhl.com/longitude> " 
           (:lng g) " ." )	]
[lat lon] ))

(defn make-geo-facts []
 (let [ a (bounce team-city dbp)
  f "files/geo-facts.nt" ]
 (spit f (string/join "\n" (flatten (map get-geo-fact (:rows a)))))	))

The results are created with the following two calls at the REPL:

(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)

Now that I have geo-data, I can finish with hits-per-game as a rough cut at a physicality scale for games, and see where the action was this season.  I wonder if the Islanders and Rangers still go at it.

A few posts back, I wrote the following as a rule for detecting enforcement events:

An event is an enforcement if it is an Enforcer Action taking place within 2 minutes after a Violent Action against one of the teammates of the Enforcer.

Since then, a few parts of this rule have been bugging me:

  • what’s an enforcer action?
  • why two minutes?
  • what’s a violent action?

So I’m going to take a step back and look at this again.  I’m defining enforcement event to be a manifestation of an enfoo quantity (like gravitational attraction would be for a mass quantity).  If these manifestations are what makes somebody an enforcer, then it may be that a pattern amongst these events can be extracted that determines an ordering of players sufficiently close to the original crude orderings of enforcers. So,

  1. hockey players can be crudely ordered according to their capabilities as an enforcer.
  2. I dub this capability enfoo
  3. I hypothesize that enfoo manifests in certain ways that can be analyzed to yield an ordering of players according to their degree of enfoo
    • That is, I hypothesize that enfoo can be measured via enfoo manifestations, which I call enforcements
  4. A scale for measuring enfoo via enforcements is defined, and is called the barce scale.
  5. Barce measurements are made on NHL play-by-play data, and the resulting player orderings are compared to crude orderings.
    • If the barce orderings track the crude orderings closely, then the barce scale captures the intuitions of the creators of the crude orderings (i.e. experts).
    • If the barce and crude orderings are very different, then all is not lost.  If the semantics of the barce scale are interesting, then the scale provides a new perspective on the concept of enforcement.

This all tracks the 1968 work of Brian Ellis quite well (this would be a fundamental derived measurement in his theory).  However, not everything rests on solid ground here.  Ellis didn’t really discuss social quantities (like enfoo), instead concentrating on physical quantities like mass, pressure, and electrical resistance.  Thus, the idea of looking at events (manifestations of enfoo) brings with it some difficulties.

One difficulty is that the units of the barce scale are quite unlike grams and meters.  When defining the meter, experts talk about how far light travels in a vacuum in a very short period of time.  The gram is based on an actual object known as the IPK (International Prototype Kilogram).

The barce scale introduces two kinds of difficulties, which I alluded to at the top of this post: subjectivity and causal influence.

The events of a hockey game, such as penalties and goals, are only identified by certain officials on the ice and in the booth.  It doesn’t matter whether you or I think a hit was a cross-check, even if the officials for that game agree with us the next day – if an official doesn’t identify the hit as a cross-check (thus penalizing the hitter) during the game, then it’s not a cross-check.  The events of a hockey game, then, cannot be be determined without knowing the officials’ in-game decisions.  Fortunately, these decisions are themselves objectively verifiable due to their documentation.

A more worrisome difficulty is the detection of causal influence.  Enforcements are defined as events that are in some sense caused by a provocation by a member of the other team.  For example, we might consider the following causal influences:

  1. An enforcement would not have occurred had it not been for the provocation.
  2. A provocation increased the probability of an enforcement occurring.
  3. A provocation inclined a player to perform an enforcement.

An added difficulty builds on the problem with subjectivity mentioned above: a provocation may not be noticed by an official, but nevertheless be noticed by a player who subsequently goes after the provoker.

I think one way to resolve this is to take a page out of moral theory.  Let’s say that an enforcer is a player who goes after players on the other team who do something provocative.  Let’s further say that there are good definitions for what it means to go after somebody and to be provocative in the context of a hockey game.  Then we expect that a good enforcer should go after opponents who are being provocative.  In other words, the more a player retaliates, the better he is at enforcing.

The form in use here looks like this:


"quantity Q manifest as event E"
(Q manifests-as E)

"things with Q should perform events of type E"
(x has Q) => should(x do E)

"all things being equal, x has more Q having performed e"
Q(x|e) > Q(x|~e) c.p.

It’s pseudo-logic, for sure, but hopefully the intent is somewhat clear.  Looking at that pseudo-logic, there’s nothing that prevents a quantity from manifesting in more than one way.  And that’s a good thing, because good enforcers do more than retaliate – they also deter provocations.  If a player has reputation for retaliating harshly, then opponents may be less likely to provoke.  This result is even better for the enforcer’s team, since an enforcer’s retaliations sometimes land him in the penalty box.  Thus, we can say that a good enforcer plays in games where the opposing side doesn’t often provoke.  In other words, the less an opposing team provokes, the better an enforcer is doing.

I updated the analysis file with a new query to reflect this new bi-valued barce scale:

prefix nhl: <http://www.nhl.com/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?name (sum(?enfActs) as ?enf)  (sum(?totChances) as ?opp)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player .
  ?x nhl:value ?enfActs . ?x nhl:game ?game .
  ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
  ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
  ?z a nhl:GameRoster . ?z nhl:game ?game . ?z nhl:player ?player .
  ?z nhl:team ?team . ?player rdfs:label ?name .
  filter( ?team != ?otherTeam) }
group by ?name

The retaliation value is ?enf and the deterrent value is ?opp.  I haven’t gotten around to changing the variable names to something more suitable, so don’t read too much into them yet.

I also haven’t changed the definitions of nhl:Enforcement, nhl:ViolentPenalty, or nhl:EnforcerAction yet, so those are still on the docket for review.  I also opted to skip the Sparql Construct I used for the other two barce formulations and just spit the measurement out into a result set (which is interpreted as an Incanter dataset in seabass).  For purposes of ranking, the two values for this barce measurement can be multiplied (the inverse of the deterrent value has to be used, since enfoo is inversely proportional to the number of provocations).

Maybe I’ll get to some charts and analysis tomorrow, if the theory bug doesn’t bite me again.

My last post had some definitions to frame this analysis of enforcement (or, I suppose, the measurement of enfoo), so I may as well go ahead and formalize them.

Enforcement

 prefix nhl:
 construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
 _:x nhl:actor ?p1 . _:x nhl:value ?value }
 where {
 select ?g ?p1 (count(?e) as ?value)
 where {
 ?g nhl:play ?e . ?g nhl:play ?e2 .
 ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
 ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .
 ?e a nhl:EnforcerAction . ?e2 nhl:influences ?e }
 group by ?g ?p1 }

The enforcement facts tally up the number of retaliations performed by a player during a game.  The definition of an enforcement fact on a game-by-game basis (as opposed to period-by-period or minute-by-minute) is somewhat arbitrary.  However, there’s a lot of convenience here, since it’s easy to argue that hockey is most naturally partitioned by games.  If another temporal basis made sense, here is where the first definition adjustment would be made.

I should note that there’s been a slight departure from the previous post’s definition of Enforcer Action.  Previously, I only included a few kinds of penalties to count as enforcer actions (fighting, cross-checking, roughing, and unsportsmanlike conduct).  I went ahead and included hits, since it made sense as I was typing out the Construct queries – after all, enforcers that get away with a good retaliatory hit are doing their job better, right?

influences

[rule1: (?e1 nhl:influences ?e2) <-
    (?e1 nhl:localtime ?t1), (?e2 nhl:localtime ?t2),
     le(?t1, ?t2), diff-minute(?t1, ?t2, ?diff),
     lessThan(?diff, 2)  ]

I defined the relation influences to tie events together when one takes place two minutes before another.  I couldn’t think of much else in the dataset to represent when one event is a significant cause for another, and a two minute time frame should limit the number of retaliations against a player.  Of course, it’s still an arbitrary time limit, and it’d be interesting to explore other ways of identifying causal influence.

The rest of the details can be found here on github.  As an aside, one thing I’ve learned is to keep the repository names short and sweet; this repository is just one huge hyphenated mess.

If you take a gander at the analysis.clj file, you’ll see that I changed up the manner in which the models are built up, favoring a map-reduce strategy.  I played around with pmap to see if I could get a bump out of parallelization, but ran into threading problems with the reasoner.  I’m definitely not a parallelization kind of guy, so I stuck with what worked for me 🙂

There are two scales defined in the analysis file:

barce

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce ?value }
where {
  select ?player (sum(?enfActs) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs  }
  group by ?player }

barce-opp

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce-opp ?value }
where {
  select ?player (sum(?enfActs/ ?totChances) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs . ?x nhl:game ?game .
    ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
    ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
    ?z a nhl:GameRoster . ?z nhl:game ?game .
    ?z nhl:player ?player . ?z nhl:team ?team .
    filter( ?team != ?otherTeam) }
  group by ?player }

The barce scale is pretty straightforward – count up all the enforcement facts for a player over all the games in the model you’re analyzing.  The advantage is in its simplicity, both for query performance and for clarity – it’s fairly easy to explain the meaning of barce measurements to your friends.

The barce-opportunity scale is a bit more complicated.  In earlier posts, I referred to a normalization of the scale, and had this in mind.  Problem is, this isn’t normalization at all.  Rather, for any particular game and player, the barce-opportunity of that player in that game is the percentage of times that the other team did something dodgey and the enforcer retaliated.  So if the opposing team provoked the enforcer’s team four times, and the enforcer retaliated twice, his barce-opportunity for that game is one-half.

For an overall barce-opportunity measure across games, I just added them up.  The problem is that measurements on the scale can no longer be understood as percentages (since you can’t just add percentages willy-nilly).  I suppose I could average them to keep the values between zero and one, which would have the advantage of a clear interpretation for measurements across games:

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce-opp ?value }
where {
  select ?player (avg(?enfActs/ ?totChances) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs . ?x nhl:game ?game .
    ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
    ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
    ?z a nhl:GameRoster . ?z nhl:game ?game .
    ?z nhl:player ?player . ?z nhl:team ?team .
    filter( ?team != ?otherTeam) }
  group by ?player }

I’ll have to think on that a bit more.  For now, I’m fairly content with the definitions, and will next see what kinds of charts and patterns fall out of all this.

I just added some built-ins for Jena rules in seabass that’ll let me figure out how many seconds/minutes/etc apart two times/dates/datetimes are, which lets me write a new definition for Enforcement that tracks that wikipedia definition a bit more closely:

[rule1: (?e1 nhl:influences ?e2) <-
        (?e1 nhl:localtime ?t1), (?e2 nhl:localtime ?t2),
        le(?t1, ?t2), diff-minute(?t1, ?t2, ?diff),
        lessThan(?diff, 5)  ]
nhl:Penalty	rdfs:subClassOf nhl:EnforcerAction .
nhl:Hit	rdfs:subClassOf nhl:EnforcerAction .
 construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
                 _:x nhl:actor ?p1 . _:x nhl:value ?value }
where {
  select ?g ?p1 (count(?e) as ?value)
  where {
    ?g nhl:play ?e . ?g nhl:play ?e2 .
	?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
	?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .
	?e a nhl:EnforcerAction . ?e2 nhl:influences ?e }
  group by ?g ?p1 }

Essentially, I stipulated that an earlier event influences a later event if they take place within five minutes. Best rule ever? Nope. Good enough? Maybe! So we have incidents of enforcement in a game when a Violent Penalty by a player influences an Enforcer Action against that player. Let’s see how this compares to the previous two definitions of enforcement:

The normalization is all off, largely because it’s getting late.  I should in fact normalize this score on all the opportunities for enforcement, which is really all the Violent Penalties against the enforcer’s team.

This dataset is drawn from the first 20 games of the 1230-game season, so this certainly isn’t a representative sample.  The bad news is that 20 games of data is just about as much as my laptop will crunch before I get impatient.  The good news is that I can re-jigger the analysis process to pull enforcement facts out of the season at 10-game increments…sort of like a map-reduce, I guess.  Should be fun!