You are currently browsing the category archive for the ‘project’ category.

In the late 1970’s, the notion of an affordance was coined by J.J. Gibson to describe the manner in which an individual’s environment plays a part in explaining the possible actions an individual can take. For example, a dog can drink water in a forest cut by a stream in a manner that would be impossible for that dog in a desert. Likewise, a tall tree provides an opportunity for a man to see beyond a nearby hill, an opportunity that fades when the tree is cut down.

An affordance, according to [Chemero 2000], is an “immediate opportunity for behavior” by an organism. To put it another way, affordances are relationships between agents and their environments that have some causal impact on how the agent acts. Moreover, these affordances are part of the way that agents look at the world: when you see a flight of stairs going up, you see a way to climb to a higher floor. For ecological psychologists like Chemero and Gibson, perception is not restricted to sights, sounds, tastes, touches, and smells. Beyond sense perception, agents see the things around them as instruments to be manipulated.

To make things sound more odd to the unacquainted, affordances are neither objective nor subjective relationships. Because an affordance is a relationship between an agent and something else, and the removal of the agent removes the relationship altogether, an affordance is not an objective thing that can be studied in isolation of its subjects. If you take the animals out of a forest, the stream running through it no longer affords the possibility of taking a drink: there can’t be any drinks if there are no drinkers. However, the existence of an affordance does not soley reside in the mind of an agent – it’s not subjective in the sense of a thought or emotion. Regardless of whether you believe that a dog can drink water without a source of water nearby, the dog will still be thirsty.

The odd nature of affordances can be better understood by considering the claim in [Chemero 2001] that affordance descriptions are not predicates – no property is said to inhere in any object in the environment. When we describe a stream as affording the opportunity to take a drink, we can’t merely stop with a formal representation that looks like

(affords-taking-a-drink   the-stream   that-dog)

Part of the reason is that it’s not just the stream itself as an object that affords drinking for the dog. The edge of the stream, for example, can’t be a five-foot cliff that would frustrate the dog’s efforts. The stream can’t be crawling with crocodiles, nor can it be frozen over. There are lots of features in the environment that together determine whether or not the dog can quench its thirst – the stream is only a convenient target for predication, not an accurate one.

Unsurprisingly, placement matters when it comes to possible activities. Some actions possible in a New Jersey motel are impossible on the surface of the moon. Thus, a principled way of describing affordances is needed that does representational justice to place and environment while avoiding the bog of fanciful what-ifs (heights, crocodiles, and ice).

Advertisements

Following Bittner 2011, a spatial region can’t simply be recognized to have a quality like forested or polluted, where the intended meaning is distributed across the region.  That is, we expect a forested acre of land to be covered by trees; the discovery of only two small trees on that acre would smack of deceit.  So by a distributed quality of a spatial region, we understand that quality to be found throughout most, if not all of the region’s parts.

The problem is, the parts of a region aren’t exactly enumerable – you can always specify finer and finer levels of precision by which to identify parts.  It turns out that the precision you end up using goes a long way in determining whether a (distributed) quality is actually present.  So we define the presence of a quality in a region with the notion of granularity-sensitive homogeneity: Given a granularity W, a quality Q, and a breakdown of a region R into parts, then R is homogeneous with respect to Q when the area of R is roughly the same size (based on W) as the total area of R‘s sub-regions that have Q.

For example, let’s imagine a big tract of mostly-forested land.  We break down that tract into parts (like cells in a raster), and decide for each cell whether we’d consider it forested or not (say, using remote sensing).  Turns out that we have a few small ponds on our land that each take up one of our parts/cells.  To determine whether we can attribute the quality of being forested to this region, we compare its area to the area of parts/cells that are determined to be forested.  The answer depends on the level of precision we use, which is tossed into a simple formula to provide our answer.

Distributed qualities like forested are interesting to contrast with a lot of the commonplace holistic qualities we usually run into, like being electronic, sleepy, or profitable.  Distributed qualities depend quite a bit on the qualities of their parts in a way that’s irrelevant to holistic qualities.  However, some qualities of spatial regions are neither distributed nor holistic: the quality of being a habitat for badgers depends on processes that relate some parts of a region in a variety of ways.

And two months later, I’ve got both a new daughter and a new project.  Huzzah!

This new project builds Tom Bittner‘s work on vagueness and granularity in geographic regions.  A recent paper of his presents a formal system for classifying geographic regions based on their qualities.

It turns out that things get tricky when you want to apply a quality (like forested) to something like a region, because we don’t expect every part of that region to harbor a tree.  A bit more practically, we don’t expect a reasonable raster of a forested region to necessarily have trees in every cell – a raster where 99 out of 100 cells are forested is probably good enough.  What Tom does is give this intuition a rigorous formal treatment.

So I started a new project on Github called kraken where I’ll write up an implementation of his system in Clojure.  For the geoinformatics portion of this code, I’m taking a look at the GeoScript libraries; it looks like someone’s even started up a Clojure GeoScript library.

The initial phase of kraken is aimed to produce a faithful implementation of Tom’s system.  After this, I want to open up the classification of geographic regions to affordances.  I’m going to take a long look at what it means to associate an affordance (such as a habitat) to a region, which means I’ll take the time to write about how affordances tie into theories of dispositions, occurrences, and qualities.  I have a feeling that I’ll end up writing kraken in more of a logic-programming way, possibly with the Jena rules via seabass.  Part of the question there is whether RDF-Sparql will make sense for doing the semantic work needed.  Can’t wait to find out.

I was prepping my hockey data today to try out a few machine learning algorithms, and ran into a bit of a pickle.

One of the metrics I wanted to extract was the number of hits per team per game, so I tapped in this query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

The query took a little longer to run than expected, but I chalked that up to cosmic radiation. When I looked at my results, though, I was a bit shocked: over 300,000 answers! Sure, there are 1230 games in a season, and I was extracting 32 facts for each one, but that’s a whole order of magnitude lower that what I was seeing.

After a little while full of WTFs and glances at the SPARQL spec for aggregations over optional clauses, it hit me – I had overloaded the nhl:team property.  Both games and events had nhl:team facts (nhl:team is a super-property of both nhl:hometeam and nhl:awayteam):

nhl:game-1 nhl:hometeam nhl:team-10 .
nhl:game-1 nhl:awayteam nhl:team-8 .
nhl:eid-TOR51 nhl:team nhl:team-10 .

I had a choice: either I could redo my JSON parse or my SPARQL query.  Both required about the same amount of effort, but a full parse of all the game files would take over 40 minutes.  Thus, I rewrote the query:

(def team-hits "
prefix : <http://www.nhl.com/>
construct { _:x a :HitCount; :team ?team; :game ?g; :value ?value}
{ select ?g ?team (count(?e) as ?value)
  { ?g a :Game; :team ?team .
     optional { ?g :play ?e  . ?e a :Hit;  :team ?team .}
  }
  group by ?g ?team }
")

To make this work, I had to add the facts that each game was in fact an instance of the nhl:Game class.  Fortunately, I had two properties whose domain was precisely hockey games:

nhl:hometeam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

nhl:awayteam
  rdfs:domain nhl:Game ;
  rdfs:range nhl:Team ;
  rdfs:subPropertyOf nhl:team .

As luck would have it, somebody at tonight’s SemWeb meetup group wondered aloud whether anybody found rdfs:domain and rdfs:range useful.  Neat-o.

And by best, I mean the most physical as measured by hits.  In a previous post, I thought it would be interesting to predict the most physical games in a season based on the previous actions of the players.  So a good first step is to measure each game by its physicality and save off the data.

Unfortunately, the raw data I’m working with doesn’t provide identifiers for the games themselves.  People, teams, and game events (like hits) get id’s, but the games end up with a blank node.  Kind of ruins the whole ‘list the games by physicality’ idea, so on to plan B: pairings.

Since teams get identifiers in the dataset, I could get the home and visiting teams for each.  Most teams in a season will play more than twice, so a pairing for “New York Rangers at Buffalo Sabres” will include two games, as will “Buffalo Sabres at New York Rangers”.  Not all pairings will be symmetric, though – the Rangers hosted the LA Kings for the only match-up between the teams in 2010-11.

The game data is broken up into 1230 n-triples files, each corresponding to a game.  I ended up naming them file-1.nt, file-2.nt, etc.  This is convenient for defining functions that build an RDF model from each game from N to M (e.g. file-23 to file-100):

(defn get-data[ n m] (map #(str "data/file-" % ".nt") (range n (+ 1 m))))
(defn get-model [n m] (apply build (get-data n m)))

Since physicality is based on the hits recorded in each game, a good place to start is a construct query I called ‘game-hits’.

(def game-hits "
  prefix nhl: <http://www.nhl.com/>
  construct { _:z a nhl:GameHits . _:z nhl:hometeam ?x .
	      _:z nhl:awayteam ?y . _:z nhl:game ?g . _:z nhl:value ?v }
	{ select ?x ?y ?g (count(?h) as ?v)
	  {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
	     ?g nhl:play ?h . ?h a nhl:Hit . filter(?x != ?y)}
	  group by ?x ?y ?g }
")

This construct has to be pulled from each file of game data I have for the 2010-11 season, so I next write two short functions to run through each game file and execute the construct, iteratively building a model of just the facts I need.  This way, I keep the model size down.

(defn build-game-model [s]
  (let [  n1 (get-model (first s) (last s)) ]
    (build (pull game-hits n1))	))

(defn build-model [n m binsize]
  (reduce build (map #(build-game-model %) (partition-all binsize (range n (+ m 1)))))	)

Finally, I’m ready to save off my first batch of summary data: the number of hits for each home-away team pairing, and the number of games that constitute that pairing.

(def teams-hits "
prefix nhl: <http://www.nhl.com/>
construct { _:z a nhl:HitsSummary . _:z nhl:hometeam ?x .
	    _:z nhl:awayteam ?y . _:z nhl:games ?games . 
        _:z nhl:value ?value }
	{ select ?x ?y (count(?g) as ?games) (sum(?v) as ?value)
	  { ?z a nhl:GameHits . ?z nhl:hometeam ?x .
	    ?z nhl:awayteam ?y . ?z nhl:game ?g . ?z nhl:value ?v . 
        filter(?x != ?y)}
	group by ?x ?y	}
")

I’ve described how I pulled geo-coordinates for each team in a previous post, so I’ll skip the details and note the final construct query that joins the DBpedia data with my raw NHL data:

(def city-coords "
prefix nhl: <http://www.nhl.com/>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
construct { ?x nhl:latitude ?lat . ?x nhl:longitude ?lon . 
            ?x nhl:cityname ?cityname . ?x nhl:name ?a }
{ ?x a nhl:Team . ?y a dbo:HockeyTeam .
  ?x rdfs:label ?a . ?y rdfs:label ?b . filter(str(?a) = str(?b))
  ?y nhl:latitude ?lat . ?y nhl:longitude ?lon . ?y dbp:city ?cityname}
")

Finally, I have a a select query that defines the information I want for my physicality scale (with geo-coordinates thrown in for the home team for kicks):

(def hits-summary-names "
prefix nhl: <http://www.nhl.com/>
select ?visitor ?home (?v / ?g as ?avg) ?lat ?lon
{ ?z a nhl:HitsSummary . ?z nhl:hometeam ?x . 
  ?x nhl:latitude ?lat . ?x nhl:longitude ?lon .
  ?z nhl:awayteam ?y . ?z nhl:games ?g . ?z nhl:value ?v . 
  filter(?x != ?y) .
  ?x nhl:name ?home . ?y nhl:name ?visitor }
order by desc(?avg)
")

At this point I’m ready to run my program, which consists of just a few lines of code:

(stash (pull teams-hits (build-model 1 1230 5)) "files/teams-hits.nt")
(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)
(stash (pull city-coords 
          (build "files/geo-facts.nt" "files/team-city.nt" 
                 "files/ontology.ttl" "files/nhl.rules" (get-model 1 35))) 
  "files/city-coords.nt")
(def m (build "files/city-coords.nt" "files/teams-hits.nt"))
(view (histogram :avg :data (bounce hits-summary-names m) :nbins 5))
(view (bounce hits-summary-names m))

The distribution of physicality per pairing:

The top pairings by physicality:

The top teams by physicality:

The full code can be found on GitHub.  The next step is to figure out what factors can be used to predict this physicality distribution.  Enforcement will be one of them, but it would help to have a slew of others to mix in.  At this point, I’m thinking a principal components analysis would be interesting to run.

I was all ready to grab some geo-data and see where I could find a good physical game of hockey.  I’d browsed dbpedia and saw how to connect NHL teams, the cities they play in, and the lat-long coordinates of each.  I was so optimistic.

Then it turns out that DBpedia wanted me to do a special dance.  I’ve been quite happy with the features of the upcoming Sparql 1.1 spec.  Since ARQ stays on top of the spec, I’ve managed to forget what Sparql 1.0 was missing.  Well, ‘if’ clauses for one, but I managed to design around that in my last post.  A real sticker, though, was the inability to wrap a construct query around a select query, like so:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:latitude ?v . ?team nhl:name ?name }
{	select ?team ?name ( 1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
	{ 	  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
		  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
		  ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
		  filter ( lang(?name) = 'en') }}

The reason this is critical is that you can’t inject those arithmetic expressions into a construct clause.  And since I plan on working with the resulting data using Sparql, simply using select queries isn’t going to do it.

Thus, we need to break down the steps a bit more finely.  First, I’ll pull out the basic triples I intend to work with:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
                ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s . }
   {   ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:latd ?d; dbp:latm ?m; dbp:lats ?s .
        filter ( lang(?name) = 'en') }

And crap – the data doesn’t fit very well.  Looks like the city names associated with hockey teams don’t cleanly match up to the cities we’re looking for in DBpedia.  Time for a second refactor…

After a few minutes of staring at the ceiling, I realized that I could use Google’s geocoding service to do my bidding.  Since their daily limit is 2500 requests, my measly 50ish cities would be well under.  So first, I grab just the info I need out of DBpedia – hockey teams and the cities they’re associated with:

 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
 prefix dbo: <http://dbpedia.org/ontology/>
 prefix dbp: <http://dbpedia.org/property/>
 prefix nhl: <http://www.nhl.com/>
 construct { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
                 ?team dbp:city ?cityname . ?city rdfs:label ?cityname . }
      { ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
         ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
         filter ( lang(?name) = 'en') }

And use this query with a bit of Clojure to pull out my geocoding facts, saving them off as an n-triples files:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
select distinct ?team ?name ?city ?cityname
{ ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
  ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
  filter ( lang(?name) = 'en') }
(defn get-geo-fact [row]
(let [ n (string/replace (:cityname row) " "  "+")
  x (json/read-json (slurp (str
         "http://maps.googleapis.com/maps/api/geocode/json?address="
         n
         "&sensor=false")))
  g (:location (:geometry (first (:results x))))
  lat (str "<" (:city row) ">" 
           " <http://www.nhl.com/latitude> " 
           (:lat g) " ." )
  lon (str "<" (:city row) ">" 
           " <http://www.nhl.com/longitude> " 
           (:lng g) " ." )	]
[lat lon] ))

(defn make-geo-facts []
 (let [ a (bounce team-city dbp)
  f "files/geo-facts.nt" ]
 (spit f (string/join "\n" (flatten (map get-geo-fact (:rows a)))))	))

The results are created with the following two calls at the REPL:

(stash (pull team-city-constr dbp) "files/team-city.nt")
(make-geo-facts)

Now that I have geo-data, I can finish with hits-per-game as a rough cut at a physicality scale for games, and see where the action was this season.  I wonder if the Islanders and Rangers still go at it.

This post’ll be a short one tonight, since I just had to watch Adventureland (and thank god I did – that’s a great movie).

The analysis of enforcement doesn’t seem well-suited to a statistical approach, and I think it’s because there’s no pattern or distribution to be explained or predicted.  There aren’t any awards for enforcement, and top-ten lists of enforcers aren’t sufficiently … important? … to warrant prediction.

However, predicting which cities in the NHL will host the games with the most physical action is undeniably of public benefit.  As the League has cracked down on fighting and physical play over the years, it’s become increasingly important to find out where the good physical games will be played.

Although consideration for a scale of game physicality is appropriate, for now we’ll just tally the number of hits per game between each pairing of teams:

prefix nhl: <http://www.nhl.com/>
construct {
  _:z a nhl:HitsSummary . _:z nhl:participant ?x .
  _:z nhl:participant ?y . _:z nhl:value ?v }
  {  select ?x ?y (count(?h) as ?v)
     {  ?g nhl:hometeam ?x . ?g nhl:awayteam ?y .
        ?g nhl:play ?h . ?h a nhl:Hit . }
  group by ?x ?y	}

I’d like to take these results and tie them to the geo-coordinates of each team’s home city, so that I can generate a map depicting the relative physicality of each city that season.

To do this, I can use DBpedia’s endpoint to grab each team’s city’s lat/long.  On inspection, it seems that the DBpedia data provides the relevant geo-coordinates in decimal-minute-second form.  I’d much rather have it in decimal format (so that I can associate a single value for each of latitude and longitude to a team), so a little arithmetic is in order.  Fortunately, Sparql 1.1 allows arithmetic expressions in the select clauses.  Here’s the longitude query (with the latitude query being similar):

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dbo: <http://dbpedia.org/ontology/>
prefix dbp: <http://dbpedia.org/property/>
prefix nhl: <http://www.nhl.com/>
construct { ?team nhl:longitude ?v . ?team nhl:name ?name }
  {  select ?team ?name
            ( -1 * (?d + (((?m * 60) + ?s) / 3600.0)) as ?v)
     {  ?team a dbo:HockeyTeam . ?team rdfs:label ?name .
        ?team dbp:city ?cityname . ?city rdfs:label ?cityname .
        ?city dbp:longd ?d; dbp:longm ?m; dbp:longs ?s .
        filter ( lang(?name) = 'en') }}

The DBpedia server is a little persnickety about memory, so it complains if I try to ask for lat and long in the same query.

You might be wondering about that -1 in the front of that arithmetic.  Longitude is positive or negative depending on whether it’s east or west of the Prime Meridian.  Since all the NHL hockey teams are west of the Meridian, the decimal longitude is negative.  If the DBpedia endpoint were compliant with the very latest Sparql 1.1 spec, I could have used an IF operator to interpret the East/West part of the coordinate.  However, it seems that feature isn’t implemented yet in DBpedia’s server, so this’ll have to do.

All that’s left is to query for each game, its home and away teams, and join those results with the relevant coordinates and hits-summary.  That query might take a few minutes for a whole season, but it’s a static result and thus safe to save off as an RDF file for quick retrieval later.

What does this have to do with enforcement?  Well, with next season’s schedule and the current rosters, we might be able to predict this physicality distribution based on the enforcement of the players on each team for each game.  Perhaps teams with higher enforcement correlate with more physical games.  Or, maybe a game tends to have higher physicality when one team has much more enforcement than the other.  And with historical data from past seasons, there should be some opportunity for verification.  And maybe even neat maps.

A few posts back, I wrote the following as a rule for detecting enforcement events:

An event is an enforcement if it is an Enforcer Action taking place within 2 minutes after a Violent Action against one of the teammates of the Enforcer.

Since then, a few parts of this rule have been bugging me:

  • what’s an enforcer action?
  • why two minutes?
  • what’s a violent action?

So I’m going to take a step back and look at this again.  I’m defining enforcement event to be a manifestation of an enfoo quantity (like gravitational attraction would be for a mass quantity).  If these manifestations are what makes somebody an enforcer, then it may be that a pattern amongst these events can be extracted that determines an ordering of players sufficiently close to the original crude orderings of enforcers. So,

  1. hockey players can be crudely ordered according to their capabilities as an enforcer.
  2. I dub this capability enfoo
  3. I hypothesize that enfoo manifests in certain ways that can be analyzed to yield an ordering of players according to their degree of enfoo
    • That is, I hypothesize that enfoo can be measured via enfoo manifestations, which I call enforcements
  4. A scale for measuring enfoo via enforcements is defined, and is called the barce scale.
  5. Barce measurements are made on NHL play-by-play data, and the resulting player orderings are compared to crude orderings.
    • If the barce orderings track the crude orderings closely, then the barce scale captures the intuitions of the creators of the crude orderings (i.e. experts).
    • If the barce and crude orderings are very different, then all is not lost.  If the semantics of the barce scale are interesting, then the scale provides a new perspective on the concept of enforcement.

This all tracks the 1968 work of Brian Ellis quite well (this would be a fundamental derived measurement in his theory).  However, not everything rests on solid ground here.  Ellis didn’t really discuss social quantities (like enfoo), instead concentrating on physical quantities like mass, pressure, and electrical resistance.  Thus, the idea of looking at events (manifestations of enfoo) brings with it some difficulties.

One difficulty is that the units of the barce scale are quite unlike grams and meters.  When defining the meter, experts talk about how far light travels in a vacuum in a very short period of time.  The gram is based on an actual object known as the IPK (International Prototype Kilogram).

The barce scale introduces two kinds of difficulties, which I alluded to at the top of this post: subjectivity and causal influence.

The events of a hockey game, such as penalties and goals, are only identified by certain officials on the ice and in the booth.  It doesn’t matter whether you or I think a hit was a cross-check, even if the officials for that game agree with us the next day – if an official doesn’t identify the hit as a cross-check (thus penalizing the hitter) during the game, then it’s not a cross-check.  The events of a hockey game, then, cannot be be determined without knowing the officials’ in-game decisions.  Fortunately, these decisions are themselves objectively verifiable due to their documentation.

A more worrisome difficulty is the detection of causal influence.  Enforcements are defined as events that are in some sense caused by a provocation by a member of the other team.  For example, we might consider the following causal influences:

  1. An enforcement would not have occurred had it not been for the provocation.
  2. A provocation increased the probability of an enforcement occurring.
  3. A provocation inclined a player to perform an enforcement.

An added difficulty builds on the problem with subjectivity mentioned above: a provocation may not be noticed by an official, but nevertheless be noticed by a player who subsequently goes after the provoker.

I think one way to resolve this is to take a page out of moral theory.  Let’s say that an enforcer is a player who goes after players on the other team who do something provocative.  Let’s further say that there are good definitions for what it means to go after somebody and to be provocative in the context of a hockey game.  Then we expect that a good enforcer should go after opponents who are being provocative.  In other words, the more a player retaliates, the better he is at enforcing.

The form in use here looks like this:


"quantity Q manifest as event E"
(Q manifests-as E)

"things with Q should perform events of type E"
(x has Q) => should(x do E)

"all things being equal, x has more Q having performed e"
Q(x|e) > Q(x|~e) c.p.

It’s pseudo-logic, for sure, but hopefully the intent is somewhat clear.  Looking at that pseudo-logic, there’s nothing that prevents a quantity from manifesting in more than one way.  And that’s a good thing, because good enforcers do more than retaliate – they also deter provocations.  If a player has reputation for retaliating harshly, then opponents may be less likely to provoke.  This result is even better for the enforcer’s team, since an enforcer’s retaliations sometimes land him in the penalty box.  Thus, we can say that a good enforcer plays in games where the opposing side doesn’t often provoke.  In other words, the less an opposing team provokes, the better an enforcer is doing.

I updated the analysis file with a new query to reflect this new bi-valued barce scale:

prefix nhl: <http://www.nhl.com/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?name (sum(?enfActs) as ?enf)  (sum(?totChances) as ?opp)
where { ?x a nhl:Enforcement . ?x nhl:actor ?player .
  ?x nhl:value ?enfActs . ?x nhl:game ?game .
  ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
  ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
  ?z a nhl:GameRoster . ?z nhl:game ?game . ?z nhl:player ?player .
  ?z nhl:team ?team . ?player rdfs:label ?name .
  filter( ?team != ?otherTeam) }
group by ?name

The retaliation value is ?enf and the deterrent value is ?opp.  I haven’t gotten around to changing the variable names to something more suitable, so don’t read too much into them yet.

I also haven’t changed the definitions of nhl:Enforcement, nhl:ViolentPenalty, or nhl:EnforcerAction yet, so those are still on the docket for review.  I also opted to skip the Sparql Construct I used for the other two barce formulations and just spit the measurement out into a result set (which is interpreted as an Incanter dataset in seabass).  For purposes of ranking, the two values for this barce measurement can be multiplied (the inverse of the deterrent value has to be used, since enfoo is inversely proportional to the number of provocations).

Maybe I’ll get to some charts and analysis tomorrow, if the theory bug doesn’t bite me again.

My last post had some definitions to frame this analysis of enforcement (or, I suppose, the measurement of enfoo), so I may as well go ahead and formalize them.

Enforcement

 prefix nhl:
 construct { _:x a nhl:Enforcement . _:x nhl:game ?g .
 _:x nhl:actor ?p1 . _:x nhl:value ?value }
 where {
 select ?g ?p1 (count(?e) as ?value)
 where {
 ?g nhl:play ?e . ?g nhl:play ?e2 .
 ?e nhl:agent1 ?p1 . ?e nhl:agent2 ?p2 .
 ?e2 nhl:agent1 ?p2 . ?e2 a nhl:ViolentPenalty .
 ?e a nhl:EnforcerAction . ?e2 nhl:influences ?e }
 group by ?g ?p1 }

The enforcement facts tally up the number of retaliations performed by a player during a game.  The definition of an enforcement fact on a game-by-game basis (as opposed to period-by-period or minute-by-minute) is somewhat arbitrary.  However, there’s a lot of convenience here, since it’s easy to argue that hockey is most naturally partitioned by games.  If another temporal basis made sense, here is where the first definition adjustment would be made.

I should note that there’s been a slight departure from the previous post’s definition of Enforcer Action.  Previously, I only included a few kinds of penalties to count as enforcer actions (fighting, cross-checking, roughing, and unsportsmanlike conduct).  I went ahead and included hits, since it made sense as I was typing out the Construct queries – after all, enforcers that get away with a good retaliatory hit are doing their job better, right?

influences

[rule1: (?e1 nhl:influences ?e2) <-
    (?e1 nhl:localtime ?t1), (?e2 nhl:localtime ?t2),
     le(?t1, ?t2), diff-minute(?t1, ?t2, ?diff),
     lessThan(?diff, 2)  ]

I defined the relation influences to tie events together when one takes place two minutes before another.  I couldn’t think of much else in the dataset to represent when one event is a significant cause for another, and a two minute time frame should limit the number of retaliations against a player.  Of course, it’s still an arbitrary time limit, and it’d be interesting to explore other ways of identifying causal influence.

The rest of the details can be found here on github.  As an aside, one thing I’ve learned is to keep the repository names short and sweet; this repository is just one huge hyphenated mess.

If you take a gander at the analysis.clj file, you’ll see that I changed up the manner in which the models are built up, favoring a map-reduce strategy.  I played around with pmap to see if I could get a bump out of parallelization, but ran into threading problems with the reasoner.  I’m definitely not a parallelization kind of guy, so I stuck with what worked for me 🙂

There are two scales defined in the analysis file:

barce

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce ?value }
where {
  select ?player (sum(?enfActs) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs  }
  group by ?player }

barce-opp

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce-opp ?value }
where {
  select ?player (sum(?enfActs/ ?totChances) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs . ?x nhl:game ?game .
    ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
    ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
    ?z a nhl:GameRoster . ?z nhl:game ?game .
    ?z nhl:player ?player . ?z nhl:team ?team .
    filter( ?team != ?otherTeam) }
  group by ?player }

The barce scale is pretty straightforward – count up all the enforcement facts for a player over all the games in the model you’re analyzing.  The advantage is in its simplicity, both for query performance and for clarity – it’s fairly easy to explain the meaning of barce measurements to your friends.

The barce-opportunity scale is a bit more complicated.  In earlier posts, I referred to a normalization of the scale, and had this in mind.  Problem is, this isn’t normalization at all.  Rather, for any particular game and player, the barce-opportunity of that player in that game is the percentage of times that the other team did something dodgey and the enforcer retaliated.  So if the opposing team provoked the enforcer’s team four times, and the enforcer retaliated twice, his barce-opportunity for that game is one-half.

For an overall barce-opportunity measure across games, I just added them up.  The problem is that measurements on the scale can no longer be understood as percentages (since you can’t just add percentages willy-nilly).  I suppose I could average them to keep the values between zero and one, which would have the advantage of a clear interpretation for measurements across games:

prefix nhl: <http://www.nhl.com/>
construct { ?player nhl:barce-opp ?value }
where {
  select ?player (avg(?enfActs/ ?totChances) as ?value)
  where {
    ?x a nhl:Enforcement . ?x nhl:actor ?player .
    ?x nhl:value ?enfActs . ?x nhl:game ?game .
    ?y a nhl:ViolentPenaltyTally . ?y nhl:game ?game .
    ?y nhl:team ?otherTeam . ?y nhl:value ?totChances .
    ?z a nhl:GameRoster . ?z nhl:game ?game .
    ?z nhl:player ?player . ?z nhl:team ?team .
    filter( ?team != ?otherTeam) }
  group by ?player }

I’ll have to think on that a bit more.  For now, I’m fairly content with the definitions, and will next see what kinds of charts and patterns fall out of all this.

Before returning to the analysis of the NHL play-by-play data I’ve been looking at, here are some definitions to frame the project:

Enfoo:  The disposition to respond to aggressive play through retaliatory checking and fighting.
Enforcer: A person considered to have a high degree of Enfoo.
Enforcement: The kind of event that is the realization of an Enfoo.

Because no good names for the (dispositional) quantity under the scope came to mind, I chose a silly name.  Despite this, I think there’s sufficient reason to think that a crude order could be created by a group of hockey fans.  For a physical comparison, these three definitions are similar to the following:

Weight: The gravitational force on a body.
Heavy Person: A person considered to weigh a lot.
Gravitational Attraction: The kind of event where two objects attract one another due to their mass and mutual distance.

Dispositions can be tricky things to identify, so we’ll assert the following detection rule:

Detection Rule: An event is an enforcement if it is an Enforcer Action taking place within 2 minutes after a Violent Action against one of the teammates of the Enforcer.

The following definitions round out the necessary machinery:

Enforcer Action: An action penalized as fighting, cross-checking, roughing, or unsportsmanlike conduct.
Violent Action: An action penalized as charging, high-sticking, roughing, slashing, or unsportsmanlike conduct.

The goal of this analysis is to come up with a formal definition of enfoo and an appropriate measurement scale that can aid future analyses.  It would be nice if the order generated by the detection rule above was similar to rankings of enforcers by hockey analysts, but this is not necessary.  If this were a material definition of enfoo, then the goal would be to uncover the intended semantics of a group of hockey analysts that rank enforcers.  Instead, this formal definition of the disposition only depends on the semantics of the terms in its definitions (like Violent Action, Player, and Minutes).

Finally, a scale for measuring enfoo needs to be defined:

Barce: A scale for measuring the amount of enfoo a player had during a period of time.  A player has one barce when he is the agent of an enforcement once during the set period of time.  A player has (x+1) barce when he is the agent of (x+1) events of enforcement during the set period of time.

The barce scale unsurprisingly jives with the detection rule asserted above – to measure the enfoo of a player during a game on the barce scale, the enforcement events he is penalized for are tallied.  If a normalized barce scale is desired, then the tally for the game should be divided by the total possible number of enforcement events: the number of violent actions that the other team is penalized for.