I was prepping my hockey data today to try out a few machine learning algorithms, and ran into a bit of a pickle.

One of the metrics I wanted to extract was the number of hits per team per game, so I tapped in this query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

The query took a little longer to run than expected, but I chalked that up to cosmic radiation. When I looked at my results, though, I was a bit shocked: over 300,000 answers! Sure, there are 1230 games in a season, and I was extracting 32 facts for each one, but that’s a whole order of magnitude lower that what I was seeing.

After a little while full of WTFs and glances at the SPARQL spec for aggregations over optional clauses, it hit me – I had overloaded the nhl:team property.  Both games and events had nhl:team facts (nhl:team is a super-property of both nhl:hometeam and nhl:awayteam):

nhl:game-1 nhl:hometeam nhl:team-10 .
nhl:game-1 nhl:awayteam nhl:team-8 .
nhl:eid-TOR51 nhl:team nhl:team-10 .

I had a choice: either I could redo my JSON parse or my SPARQL query.  Both required about the same amount of effort, but a full parse of all the game files would take over 40 minutes.  Thus, I rewrote the query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g a :Game; :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

To make this work, I had to add the facts that each game was in fact an instance of the nhl:Game class.  Fortunately, I had two properties whose domain was precisely hockey games:

nhl:hometeam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

nhl:awayteam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

As luck would have it, somebody at tonight’s SemWeb meetup group wondered aloud whether anybody found rdfs:domain and rdfs:range useful.  Neat-o.

Advertisements