I use a Win7 machine at home, and do my Clojure work with Notepad++ and an open command prompt.  Most of the time I’m in the REPL, and I get to this via leiningen.  It’s not the most elegant way to do things (I believe that would involve Emacs), but it ain’t bad.

I was just now tapping at the REPL and wanted to go back a few characters while writing out a function call.  Absentmindedly, I used the Emacs keystroke CTRL-b, and bloody hell it worked.  So did CTRL-f and CTRL-e.

Wow.  Considering that I’m sitting on my couch with my laptop where it belongs (on top of my lap), I’m thrilled that I don’t have to get all uncomfortable finding the arrow keys when editing in-line.  Woo!

Advertisements

I was prepping my hockey data today to try out a few machine learning algorithms, and ran into a bit of a pickle.

One of the metrics I wanted to extract was the number of hits per team per game, so I tapped in this query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

The query took a little longer to run than expected, but I chalked that up to cosmic radiation. When I looked at my results, though, I was a bit shocked: over 300,000 answers! Sure, there are 1230 games in a season, and I was extracting 32 facts for each one, but that’s a whole order of magnitude lower that what I was seeing.

After a little while full of WTFs and glances at the SPARQL spec for aggregations over optional clauses, it hit me – I had overloaded the nhl:team property.  Both games and events had nhl:team facts (nhl:team is a super-property of both nhl:hometeam and nhl:awayteam):

nhl:game-1 nhl:hometeam nhl:team-10 .
nhl:game-1 nhl:awayteam nhl:team-8 .
nhl:eid-TOR51 nhl:team nhl:team-10 .

I had a choice: either I could redo my JSON parse or my SPARQL query.  Both required about the same amount of effort, but a full parse of all the game files would take over 40 minutes.  Thus, I rewrote the query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g a :Game; :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

To make this work, I had to add the facts that each game was in fact an instance of the nhl:Game class.  Fortunately, I had two properties whose domain was precisely hockey games:

nhl:hometeam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

nhl:awayteam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

As luck would have it, somebody at tonight’s SemWeb meetup group wondered aloud whether anybody found rdfs:domain and rdfs:range useful.  Neat-o.

I was fortunate enough to run across David Manley’s homepage today, which includes a list of very interesting papers.  One paper in particular, A gradable approach to dispositions (with Ryan Wasserman), looks at explaining what it means for one glass to be more fragile than another.

The approach taken involves recognizing two distinct parts of a disposition – its triggers and its manifestations.  Let’s say you enumerate all the ways a disposition can manifest, and call that the trigger-space for a disposition (my words, not theirs).  You can then base a scale for measuring the magnitude of something’s disposition on the size of the part of the trigger-space that leads to this thing manifesting the disposition.

For example, let’s take a glass and assess its fragility.  We might define its trigger space based on a range of heights from which we can drop it (e.g. 1 foot, 2 feet, 3 feet) and a set of surfaces (e.g. my pillow, my kitchen table, and the carpet in my living room).  So we have a two-dimensional trigger space with nine possible combinations.  If the glass would shatter from every height on my kitchen table, from 3 feet on my carpet, and never on my pillow, then the measurement would be based on the ratio 4/9 (depending on the choice of units).

This is a pretty cool idea that can be immediately applied to an analysis of retaliation.  I enumerate the various ways that a retaliation can be triggered/provoked, and identify what the manifestation can look like (there’s a separate issue whether a disposition can manifest in different ways – I lean towards yes, but I’ll leave the discussion for another time).

Empirical testing for agentive dispositions like retaliation (i.e. dispositions whose manifestations depend on some conscious decision) are a little trickier than physical dispositions like fragility: you can’t depend on the same triggers leading to the same results every time.  However, the core idea remains the same – by tallying the times that a trigger is followed by a manifestation, a measure of tendency for that can be established.

I’ve only read through the paper once, so I could very well be missing some key points.  And I wonder about the ability to define dispositions that don’t have obvious triggers (e.g. the disposition to yawn, which at first glance seems to be a spontaneous event that happens more or less frequently).  Heck, it seems that E.J. Lowe has reason to think that dispositions don’t even have triggers.  Nevertheless, I think Manley and Wasserman’s paper is pretty damned interesting.

A bit of discussion on URIs in the T-Box has been popping up lately on the W3C’s Semantic Web mailing list.  At issue in a thread on “Best Practice for Renaming OWL Vocabulary Elements” is whether T-Box URIs should be somewhat meaningful names, like foo:fatherOf, or meaningless names, like foo:R9814.

There are pros and cons for both sides.  Meaningful T-Box URIs are easier to to use when making ontologies and writing queries.  However, it’s very difficult to change a URI after you’ve created it (you may later become unsatisfied with an unforeseen connotation of your URI).  Moreover, you have to accept that the language you’re using (like English or Tagalog) may not be understood by a user.

Meaningless T-Box URIs have the pros and cons reversed – harder for creating ontologies and writing queries, easier for lifecycle management and (in theory) buy-in from non-native speakers.  To sweeten the deal for these meaningless URIs, advocates point out that tools can be written to correct for the difficulties in ontology and query authoring.

This all brings to mind a division in labor in the ontology/semantic web community, which you might call A-ontology and T-ontology (tracking the distinction between the A-Box and T-Box in description logics).

A-ontology is focused on analyzing data, leveraging the implicit semantics found in datasets.  Ontologies are a tool for declaring semantics so that

  • It’s clear what assumptions and interpretations have been made in the analysis
  • A reasoner can read and process the additional semantics

There’s no community effort to re-use the ontology going on, so these ontologies are narrowly purpose-driven.  Not to say the semantics comes cheap – a data analysis is ill-served by a rush to charts and graphs.

T-ontology is a bit different, primarily focussed on sharing re-usable terminology.  The ontology is not a tool, but rather the product of T-ontology work.  Communities of interest and stakeholders factor into the development of the ontology, since they and the people they represent are going to be using the fruits of this labor.

These two kinds of ontology work intersect in practice.  A-ontology will often use data that’s been published with semantics governed by the product of T-ontology.  If significant results are found when performing A-ontology work, the analysts may consider publication of their results, meaning a re-consideration of terminology and alignment with community standards.

A realization of the Linked Data dream of a globally-distributed data space is an undoubtedly good thing.  If meaningless T-Box URIs help this dream along, then we just need to be sure we’re not crimping the style of A-ontology.  If tools have to be written, then they need to fit the workflow of A-ontology before changes to RDF or SPARQL are made (and most modern data analysis takes place at the command line with scripts on the side – GUIs and faceted browsers won’t find a large audience here).

As things things currently stand (with RDF 2004 and Sparql 1.1), meaningless URIs would overburden A-ontology efforts.  It’s hard to imagine how I’d productively use the rrdf library (or rdflib+scipy or seabass+incanter or any ontology and data analysis library combination) if I had to deal with meaningless URIs in the T-Box.

I’ve been reading a few articles by Nancy Cartwright (not of Simpsons fame)  on measurement lately.  Her ideas differ from the ones I’ve been using recently by Brian Ellis, who stresses a focus on orderings when devising a measurement system.  Cartwright instead focuses on the quantities driving those orderings, stressing three requirements for measurement: characterization of the quantity, appropriate metrical system, and rules for applying the system in practice.

First, we consider just what this quantity is that we’re talking about – what kinds of things can have the quantity, what values it can have, and how it relates or is influenced by other qualities and quantities.  For example, the degree to which a player is acting like an enforcer may have multiple dimensions, each of which may lead to an indicator to look for.

Next, a numerical system for conveying the diagram is devised.  This system’s mathematical properties should reflect the quantity’s own properties – if an enforcement measure of 10 is not twice as great as a measure of 5, then common multiplication shouldn’t be present in the numerical system.  To put it another way, 10am is not twice as great as 5am – it’s only 5 hours greater, so addition can be defined in a numerical system for clock time, but not multiplication.

Finally, there needs to be a procedure for figuring out how to actually measure things to get these numbers, such as using a thermometer for temperature or a scale for weight.

An analysis following Cartwright’s requirements would subsequently lead to an interpretation of enforcement in a formal framework that is (more or less) popularly accepted in the scientific community.  So we might first talk about enforcement in terms of retaliations.  Next we search for a framework where something like retaliation has been studied – perhaps a particular game theoretic model for the tragedy of the commons.  Then we interpret our theory of retaliation in terms of this game theoretic model.  If we can derive the characteristics of retaliation inside the model, then we’ve both demonstrated the fit of the model and tied our theory of retaliation to a larger body of scientific work.

It’s an interesting idea, because a deeper understanding of retaliation may help  find better instances of enforcement by refining the original definition.  There’s no guarantee of a pay-off, of course, and your background view of the ontological status of quantities and measurement will color the expectation of profit.  But I have time, so it probably pays off to try both ideas and see if the results differ.  I’m guessing that the next step down Cartwright’s route will be towards her work in causation.

When I first started playing around with Clojure, I was thinking a bit about orderings and how (according to Ellis’ theory of measurement) they’re fundamental to the entire notion of quantity.  Essentially, you can only have a quantity if you can say that something is bigger than something else in some way.

In RDF we represent orderings (like everything) with triples.  Since I only cared about a single ordering relation, it was safe to use ordered pairs to do this.  So, first thing I wrote was a little function to produce a map of ordered pairs.

(defn bar [num]
  (let [ a (map #(keyword (str %)) (range 1 (+ num 1)))
         b (map #(keyword (str %)) (range 2 (+ num 2))) ]
  (zipmap a b) ))

And a function to produce a vector of the elements of those ordered pairs in the proper order:

(defn infima [m] (first (difference (set (keys m)) (set (vals m)))) )
(defn reduction [m]
  (let [ ans [(infima m)] ]
    (loop [ L ans ]
      (let [ N (m (peek L)) ]
        (if (nil? N)
          (vec L)
          (recur (conj L N))  )))))

For example I could generate the order pairs for the numbers one through five, and then order them up.

(bar 4)
>> {:4 :5, :3 :4, :2 :3, :1 :2}
(reduction (bar 4))
>> [:1 :2 :3 :4 :5]

The performance was really good – I could recover an ordering out of 1,200,000 ordered pairs in 22 seconds.  Woo, right?

Well, since I was doing so well, I thought about the case where I ran into a triple store that declared the ordering relation I cared about to be transitive.  For example, I might want to recover the maternal ancestry line of Amanda Seyfried.  Let’s say I found a triple store with the relevant information, with the relevant property named hasMaternalAncestor.

If no reasoner is working behind the scenes, I could use my reduction function above.  However, I’ll need something else if hasMaternalAncestor is declared to be transitive.  And since linear orderings are transitive by definition, it’s not beyond belief to run into this situation.  So I wrote the following functions to do my dirty work:

(defn agg [element coll]
  "given a collection of binary sets, and an element e, 
  get a collection of the elements z s/t [e z] is in the collection"
  (map #(get % 1) (filter #(= (get % 0) element) coll))	)

(defn transitive-reduction [ordered-pairs]
  "we get the elements of the ordered-pairs that have a successor, 
  then we get a vector that relates each of those elements to the 
  number of successors it has (the count comes first, then the element).  Finally, we
  pull out the elements in order from this ordered vector (ie 'ranked'), 
  and return an ordered list"
  (let [ elements (union 
                    (set (map #(get % 0) ordered-pairs)) 
                    (set (map #(get % 1) ordered-pairs)) )
         ranked   (rseq (vec (into (sorted-map) 
                        (zipmap (map #(count (agg % ordered-pairs)) 
                                  elements) elements))) )
         ordered  (map #(get % 1) ranked)  ]
    (vec ordered)	))

And a new function to generate the appropriate pairs of “integers”:

(defn foo [num]
  (let [ v (vec (range 1 (+ 1 num))) ]
    (loop [ L #{}, V v ]
      (if (empty? V)
        (vec L)
        (recur (union L (set (map #(vector (keyword (str %)) 
                  (keyword (str (peek V)))) (pop V)))) (pop V))	))))

So I tried out a transitive reduction of the pairs for the integers from 1 to 1200.  Heck, if I can reduce over a million in 22 seconds, 1200 should be nothing, right?

(println (time (take 10 (transitive-reduction (foo 1200)))))

Three minutes later, I get my truncated answer.  I looked at my code over and over, trying to figure out what I did wrong.  I walked away, took a nap, and came back.  Still three minutes.  Oi!

Finally, I decided to figure out just how hard my algorithm was working.  How many pairs were involved in a transitive reduction of (foo 1200)?  Well, that’d be 1200! (i.e. the factorial of 1200):

6.3507890863456767124026223135865e+3175

Dang, that’s a lot.  I’m surprised I got an answer at all!  So yeah, counting is important.

I ran into this last night when I was looking at my definition of enforcement.  Initially, the plan was to look at a game and measure the enforcement for each player active in the game, and later aggregate these for each game for each team.

It sounded reasonable at first blush, until my queries and model building were taking way too long.  After too much time spent staring, I finally decided to count the triples:


~30 players x 2 teams x 1230 games x 4 facts per player = 295,200 facts

Yep, way too much.  My whole dataset is 50Mb on disk, and I was producing a 30Mb N-triples file to work with.  Oi.

Having realized the error of my way, I kept the same enforcement scale but only kept track of the players that actually took part in an enforcement.  So instead of 30ish players per team per game, I only tracked 3 or 4.  On top of that, I immediately extracted the team’s enforcement value for that game, meaning I only had


4 players x 2 teams x 1230 games x 4 facts per player = 39,360

Much better.  This may be one of the first lessons I teach my son when he’s old enough to start counting.

And by best, I mean the most physical as measured by hits.  In a previous post, I thought it would be interesting to predict the most physical games in a season based on the previous actions of the players.  So a good first step is to measure each game by its physicality and save off the data.

Unfortunately, the raw data I’m working with doesn’t provide identifiers for the games themselves.  People, teams, and game events (like hits) get id’s, but the games end up with a blank node.  Kind of ruins the whole ‘list the games by physicality’ idea, so on to plan B: pairings.

Since teams get identifiers in the dataset, I could get the home and visiting teams for each.  Most teams in a season will play more than twice, so a pairing for “New York Rangers at Buffalo Sabres” will include two games, as will “Buffalo Sabres at New York Rangers”.  Not all pairings will be symmetric, though – the Rangers hosted the LA Kings for the only match-up between the teams in 2010-11.

The game data is broken up into 1230 n-triples files, each corresponding to a game.  I ended up naming them file-1.nt, file-2.nt, etc.  This is convenient for defining functions that build an RDF model from each game from N to M (e.g. file-23 to file-100):

(defn get-data[ n m] (map #(str "data/file-" % ".nt") (range n (+ 1 m))))
(defn get-model [n m] (apply build (get-data n m)))

Since physicality is based on the hits recorded in each game, a good place to start is a construct query I called ‘game-hits’.

(def game-hits "
  prefix nhl: <http://www.nhl.com/>
  construct { _:z a nhl:GameHits . _:z nhl:hometeam ?x .
	      _:z nhl:awayteam ?y . _:z nhl:game ?g . _:z nhl:value ?v }
	{ select ?x ?y ?g (count(?h) as ?v)
	  {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
	     ?g nhl:play ?h . ?h a nhl:Hit . filter(?x != ?y)}
	  group by ?x ?y ?g }
")

This construct has to be pulled from each file of game data I have for the 2010-11 season, so I next write two short functions to run through each game file and execute the construct, iteratively building a model of just the facts I need.  This way, I keep the model size down.

(defn build-game-model [s]
  (let [  n1 (get-model (first s) (last s)) ]
    (build (pull game-hits n1))	))

(defn build-model [n m binsize]
  (reduce build (map #(build-game-model %) (partition-all binsize (range n (+ m 1)))))	)

Finally, I’m ready to save off my first batch of summary data: the number of hits for each home-away team pairing, and the number of games that constitute that pairing.

(def teams-hits "
prefix nhl: <http://www.nhl.com/>
construct { _:z a nhl:HitsSummary . _:z nhl:hometeam ?x .
	    _:z nhl:awayteam ?y . _:z nhl:games ?games . 
        _:z nhl:value ?value }
	{ select ?x ?y (count(?g) as ?games) (sum(?v) as ?value)
	  { ?z a nhl:GameHits . ?z nhl:hometeam ?x .
	    ?z nhl:awayteam ?y . ?z nhl:game ?g . ?z nhl:value ?v . 
        filter(?x != ?y)}
	group by ?x ?y	}
")

I’ve described how I pulled geo-coordinates for each team in a previous post, so I’ll skip the details and note the final construct query that joins the DBpedia data with my raw NHL data:

(def city-coords "
prefix nhl: <http://www.nhl.com/>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
construct { ?x nhl:latitude ?lat . ?x nhl:longitude ?lon . 
            ?x nhl:cityname ?cityname . ?x nhl:name ?a }
{ ?x a nhl:Team . ?y a dbo:HockeyTeam .
  ?x rdfs:label ?a . ?y rdfs:label ?b . filter(str(?a) = str(?b))
  ?y nhl:latitude ?lat . ?y nhl:longitude ?lon . ?y dbp:city ?cityname}
")

Finally, I have a a select query that defines the information I want for my physicality scale (with geo-coordinates thrown in for the home team for kicks):

(def hits-summary-names "
prefix nhl: <http://www.nhl.com/>
select ?visitor ?home (?v / ?g as ?avg) ?lat ?lon
{ ?z a nhl:HitsSummary . ?z nhl:hometeam ?x . 
  ?x nhl:latitude ?lat . ?x nhl:longitude ?lon .
  ?z nhl:awayteam ?y . ?z nhl:games ?g . ?z nhl:value ?v . 
  filter(?x != ?y) .
  ?x nhl:name ?home . ?y nhl:name ?visitor }
order by desc(?avg)
")

At this point I’m ready to run my program, which consists of just a few lines of code:

(stash (pull teams-hits (build-model 1 1230 5)) "files/teams-hits.nt")
(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)
(stash (pull city-coords 
          (build "files/geo-facts.nt" "files/team-city.nt" 
                 "files/ontology.ttl" "files/nhl.rules" (get-model 1 35))) 
  "files/city-coords.nt")
(def m (build "files/city-coords.nt" "files/teams-hits.nt"))
(view (histogram :avg :data (bounce hits-summary-names m) :nbins 5))
(view (bounce hits-summary-names m))

The distribution of physicality per pairing:

The top pairings by physicality:

The top teams by physicality:

The full code can be found on GitHub.  The next step is to figure out what factors can be used to predict this physicality distribution.  Enforcement will be one of them, but it would help to have a slew of others to mix in.  At this point, I’m thinking a principal components analysis would be interesting to run.

Reading through the internet’s definition of bigotry, I find myself nodding along as I make a comparison to my views on the Web Ontology Language.  Unfortunately, the definition of bigotry I found makes no mention of weakly-grounded intolerance, which is how I’d describe my own feelings.

There’s clearly nothing wrong with OWL (or OWL-2) as a logical system.  I’ve nothing against disjoint unions, datatype properties, or n-ary datatypes.  There are plenty of reasoners that cater to OWL semantics, and many IDEs (e.g. Protege and TopBraid Composer) weave OWL reasoning right into the developer’s workflow. It’s even defined in the RDF framework, so that using OWL is only a namespace declaration away.

My problem is that I never use OWL.  When working with data, I’ll often write an ontology to hold the semantics I’m adding to a model.  However, that ontology will virtually always be based on RDF/S – classes, domains, ranges, labels, and sequences constitute a pretty powerful palette.  When I want something over and above this, what I usually need is a rule system like Jena rules, SPIN, Common Logic, or RIF.

Bigotry may not be the word I’m looking for – the wiki page in fact makes an interesting suggestion that I may be suffering from RDF Purism.  Indeed, I do feel inexplicably better if I manage to express my ontology using only RDF/S.  Perhaps this is the same kind of irrational tendency that drives some people to prefer hammers to nail guns, others to write Lisp instead of Java, and a few to favor kites over model rockets – a tendency towards simplicity.

I’ve no idea how I’d write a knock-down argument for simplicity in semantic tooling, and it probably comes down to personal preference.  Lots of people like RDF/XML, TopBraid Composer, and Java, and I’d never suggest anything to be wrong with this tool kit.  Me, I like Turtle, Notepad++, Clojure, and a REPL.

I was all ready to grab some geo-data and see where I could find a good physical game of hockey.  I’d browsed dbpedia and saw how to connect NHL teams, the cities they play in, and the lat-long coordinates of each.  I was so optimistic.

Then it turns out that DBpedia wanted me to do a special dance.  I’ve been quite happy with the features of the upcoming Sparql 1.1 spec.  Since ARQ stays on top of the spec, I’ve managed to forget what Sparql 1.0 was missing.  Well, ‘if’ clauses for one, but I managed to design around that in my last post.  A real sticker, though, was the inability to wrap a construct query around a select query, like so:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:latitude ?v . ?team nhl:name ?name }
{	select ?team ?name ( 1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
	{ 	  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
		  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
		  ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
		  filter ( lang(?name) = 'en') }}

The reason this is critical is that you can’t inject those arithmetic expressions into a construct clause.  And since I plan on working with the resulting data using Sparql, simply using select queries isn’t going to do it.

Thus, we need to break down the steps a bit more finely.  First, I’ll pull out the basic triples I intend to work with:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
                ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s . }
   {   ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
        filter ( lang(?name) = 'en') }

And crap – the data doesn’t fit very well.  Looks like the city names associated with hockey teams don’t cleanly match up to the cities we’re looking for in DBpedia.  Time for a second refactor…

After a few minutes of staring at the ceiling, I realized that I could use Google’s geocoding service to do my bidding.  Since their daily limit is 2500 requests, my measly 50ish cities would be well under.  So first, I grab just the info I need out of DBpedia – hockey teams and the cities they’re associated with:

 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
 prefix dbo: <http://dbpedia.org/ontology/>
 prefix dbp: <http://dbpedia.org/property/>
 prefix nhl: <http://www.nhl.com/>
 construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                 ?team dbp:city ?cityname . ?city rdfs:label ?cityname . }
      { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
         ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
         filter ( lang(?name) = 'en') }

And use this query with a bit of Clojure to pull out my geocoding facts, saving them off as an n-triples files:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
select distinct ?team ?name ?city ?cityname
{ ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
  filter ( lang(?name) = 'en') }
(defn get-geo-fact [row]
(let [ n (string/replace (:cityname row) " "  "+")
  x (json/read-json (slurp (str
         "http://maps.googleapis.com/maps/api/geocode/json?address="
         n
         "&sensor=false")))
  g (:location (:geometry (first (:results x))))
  lat (str "<" (:city row) ">" 
           " <http://www.nhl.com/latitude> " 
           (:lat g) " ." )
  lon (str "<" (:city row) ">" 
           " <http://www.nhl.com/longitude> " 
           (:lng g) " ." )	]
[lat lon] ))

(defn make-geo-facts []
 (let [ a (bounce team-city dbp)
  f "files/geo-facts.nt" ]
 (spit f (string/join "\n" (flatten (map get-geo-fact (:rows a)))))	))

The results are created with the following two calls at the REPL:

(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)

Now that I have geo-data, I can finish with hits-per-game as a rough cut at a physicality scale for games, and see where the action was this season.  I wonder if the Islanders and Rangers still go at it.

This post’ll be a short one tonight, since I just had to watch Adventureland (and thank god I did – that’s a great movie).

The analysis of enforcement doesn’t seem well-suited to a statistical approach, and I think it’s because there’s no pattern or distribution to be explained or predicted.  There aren’t any awards for enforcement, and top-ten lists of enforcers aren’t sufficiently … important? … to warrant prediction.

However, predicting which cities in the NHL will host the games with the most physical action is undeniably of public benefit.  As the League has cracked down on fighting and physical play over the years, it’s become increasingly important to find out where the good physical games will be played.

Although consideration for a scale of game physicality is appropriate, for now we’ll just tally the number of hits per game between each pairing of teams:

prefix nhl: <http://www.nhl.com/>
construct {
  _:z a nhl:HitsSummary . _:z nhl:participant ?x .
  _:z nhl:participant ?y . _:z nhl:value ?v }
  {  select ?x ?y (count(?h) as ?v)
     {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
        ?g nhl:play ?h . ?h a nhl:Hit . }
  group by ?x ?y	}

I’d like to take these results and tie them to the geo-coordinates of each team’s home city, so that I can generate a map depicting the relative physicality of each city that season.

To do this, I can use DBpedia’s endpoint to grab each team’s city’s lat/long.  On inspection, it seems that the DBpedia data provides the relevant geo-coordinates in decimal-minute-second form.  I’d much rather have it in decimal format (so that I can associate a single value for each of latitude and longitude to a team), so a little arithmetic is in order.  Fortunately, Sparql 1.1 allows arithmetic expressions in the select clauses.  Here’s the longitude query (with the latitude query being similar):

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:longitude ?v . ?team nhl:name ?name }
  {  select ?team ?name
            ( -1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
     {  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:longd ?d; dbp:longm ?m; dbp:longs ?s .
        filter ( lang(?name) = 'en') }}

The DBpedia server is a little persnickety about memory, so it complains if I try to ask for lat and long in the same query.

You might be wondering about that -1 in the front of that arithmetic.  Longitude is positive or negative depending on whether it’s east or west of the Prime Meridian.  Since all the NHL hockey teams are west of the Meridian, the decimal longitude is negative.  If the DBpedia endpoint were compliant with the very latest Sparql 1.1 spec, I could have used an IF operator to interpret the East/West part of the coordinate.  However, it seems that feature isn’t implemented yet in DBpedia’s server, so this’ll have to do.

All that’s left is to query for each game, its home and away teams, and join those results with the relevant coordinates and hits-summary.  That query might take a few minutes for a whole season, but it’s a static result and thus safe to save off as an RDF file for quick retrieval later.

What does this have to do with enforcement?  Well, with next season’s schedule and the current rosters, we might be able to predict this physicality distribution based on the enforcement of the players on each team for each game.  Perhaps teams with higher enforcement correlate with more physical games.  Or, maybe a game tends to have higher physicality when one team has much more enforcement than the other.  And with historical data from past seasons, there should be some opportunity for verification.  And maybe even neat maps.