When I first started playing around with Clojure, I was thinking a bit about orderings and how (according to Ellis’ theory of measurement) they’re fundamental to the entire notion of quantity.  Essentially, you can only have a quantity if you can say that something is bigger than something else in some way.

In RDF we represent orderings (like everything) with triples.  Since I only cared about a single ordering relation, it was safe to use ordered pairs to do this.  So, first thing I wrote was a little function to produce a map of ordered pairs.

(defn bar [num]
  (let [ a (map #(keyword (str %)) (range 1 (+ num 1)))
         b (map #(keyword (str %)) (range 2 (+ num 2))) ]
  (zipmap a b) ))

And a function to produce a vector of the elements of those ordered pairs in the proper order:

(defn infima [m] (first (difference (set (keys m)) (set (vals m)))) )
(defn reduction [m]
  (let [ ans [(infima m)] ]
    (loop [ L ans ]
      (let [ N (m (peek L)) ]
        (if (nil? N)
          (vec L)
          (recur (conj L N))  )))))

For example I could generate the order pairs for the numbers one through five, and then order them up.

(bar 4)
>> {:4 :5, :3 :4, :2 :3, :1 :2}
(reduction (bar 4))
>> [:1 :2 :3 :4 :5]

The performance was really good – I could recover an ordering out of 1,200,000 ordered pairs in 22 seconds.  Woo, right?

Well, since I was doing so well, I thought about the case where I ran into a triple store that declared the ordering relation I cared about to be transitive.  For example, I might want to recover the maternal ancestry line of Amanda Seyfried.  Let’s say I found a triple store with the relevant information, with the relevant property named hasMaternalAncestor.

If no reasoner is working behind the scenes, I could use my reduction function above.  However, I’ll need something else if hasMaternalAncestor is declared to be transitive.  And since linear orderings are transitive by definition, it’s not beyond belief to run into this situation.  So I wrote the following functions to do my dirty work:

(defn agg [element coll]
  "given a collection of binary sets, and an element e, 
  get a collection of the elements z s/t [e z] is in the collection"
  (map #(get % 1) (filter #(= (get % 0) element) coll))	)

(defn transitive-reduction [ordered-pairs]
  "we get the elements of the ordered-pairs that have a successor, 
  then we get a vector that relates each of those elements to the 
  number of successors it has (the count comes first, then the element).  Finally, we
  pull out the elements in order from this ordered vector (ie 'ranked'), 
  and return an ordered list"
  (let [ elements (union 
                    (set (map #(get % 0) ordered-pairs)) 
                    (set (map #(get % 1) ordered-pairs)) )
         ranked   (rseq (vec (into (sorted-map) 
                        (zipmap (map #(count (agg % ordered-pairs)) 
                                  elements) elements))) )
         ordered  (map #(get % 1) ranked)  ]
    (vec ordered)	))

And a new function to generate the appropriate pairs of “integers”:

(defn foo [num]
  (let [ v (vec (range 1 (+ 1 num))) ]
    (loop [ L #{}, V v ]
      (if (empty? V)
        (vec L)
        (recur (union L (set (map #(vector (keyword (str %)) 
                  (keyword (str (peek V)))) (pop V)))) (pop V))	))))

So I tried out a transitive reduction of the pairs for the integers from 1 to 1200.  Heck, if I can reduce over a million in 22 seconds, 1200 should be nothing, right?

(println (time (take 10 (transitive-reduction (foo 1200)))))

Three minutes later, I get my truncated answer.  I looked at my code over and over, trying to figure out what I did wrong.  I walked away, took a nap, and came back.  Still three minutes.  Oi!

Finally, I decided to figure out just how hard my algorithm was working.  How many pairs were involved in a transitive reduction of (foo 1200)?  Well, that’d be 1200! (i.e. the factorial of 1200):

6.3507890863456767124026223135865e+3175

Dang, that’s a lot.  I’m surprised I got an answer at all!  So yeah, counting is important.

I ran into this last night when I was looking at my definition of enforcement.  Initially, the plan was to look at a game and measure the enforcement for each player active in the game, and later aggregate these for each game for each team.

It sounded reasonable at first blush, until my queries and model building were taking way too long.  After too much time spent staring, I finally decided to count the triples:


~30 players x 2 teams x 1230 games x 4 facts per player = 295,200 facts

Yep, way too much.  My whole dataset is 50Mb on disk, and I was producing a 30Mb N-triples file to work with.  Oi.

Having realized the error of my way, I kept the same enforcement scale but only kept track of the players that actually took part in an enforcement.  So instead of 30ish players per team per game, I only tracked 3 or 4.  On top of that, I immediately extracted the team’s enforcement value for that game, meaning I only had


4 players x 2 teams x 1230 games x 4 facts per player = 39,360

Much better.  This may be one of the first lessons I teach my son when he’s old enough to start counting.

Advertisements