Friday
19Feb2010

PostGIS BBOX Query Gotcha

I got stung by this one after processing quite a bit of data.  When doing a nearest neighbor search, I have been leveraging the GiST index functionality in PostGIS.  The documentation describes how to take advantage of these indexes be using the && operator to first find overlapping bounding boxes and then do the compute intensive calculation on the smaller subset of matched features.  However, there is a condition in which the overlapping bounding box operator does not return the nearest features.  Perhaps this is well know, but I got hit by it.

Consider the case of searching for the nearest road from a point on a map.  A natural way of performing this search is to expand a bounding box around the point and use the && operator to select the roads that intersect that bounding box.  Then, the distance to each of those roads in the returned subset is computed and the minimum distance is returned.  Observe the following scenario:

The tiny square in the upper right (just below and to the right of the rectangle) is the bbox of the point from which we wish to find the nearest road.  The yellow rectangle is the bbox of the nearest road.  The large transparent blue rectangle is a bbox of the next nearest road.  So the only overlapping bbox for the point is the large rectangle.  Thus, the && operator does not find the nearest road and our calculation is wrong.

The solution I have come up with so far is to use the bbox operator, compute the nearest distance, and use a new bbox expanded around the point using the just computed distance.  This operation will find any overlapping bbox within that range and will come up with the correct nearest road.  I don't like this solution as it requires two && searches and multiple distance computations - not very optimal.

Wednesday
17Feb2010

Incanter and the GLM

I read somewhere that the Generalized Linear Model is the "workhorse of statistics" though I cannot seem to find the reference anymore.  The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions.  Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables.  This transformation is called the link function and is unique for each distribution in the exponential family.  Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation.  In particular, iteratively reweighted least squares of the likelihood function has been shown to converge on the MLE.

To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm.  If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:

(defn irls [y X B invlink dlink weight eps]
  (let [
    _irls (fn [Bnext] 
          (let [
        eta (mmult X Bnext)
        mu (invlink eta)
        z (plus eta (mult (minus y mu) (dlink mu)))
        W (diag (weight mu))]
        (mmult 
         (solve (mmult (trans X) W X)) 
         (trans X) W z)))
    ]
    (last 
     (last 
      (take-while 
       (fn [x] (> (euclidean-distance 
           (first x) 
           (last x)) eps)) 
       (partition 2 1 (iterate _irls B)))))))

In the above code, we define the update step as an internal function of the updated coefficients variable.  Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.

Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions.  I have shown Gaussian and Binomial distributions below:

(defstruct family :link :invlink :dlink :weight)
(def families
     {
     :gaussian (struct-map family
           :link (fn [x] x)
           :invlink (fn [x] x)
           :dlink (fn [x] 1)
           :weight (fn [mu] (repeat (length mu) 1)))
     :binomial (struct-map family
           :link (fn [x] (log (div x (minus 1 x))))
           :invlink (fn [x] (div (exp x) (plus 1 (exp x))))
           :dlink (fn [x] [x] (div 1 (mult x (minus 1 x))))
           :weight (fn [mu] (to-vect (mult mu (minus 1 mu)))))
      })

I have used the struct-map technique from Clojure which gives me a sort of family type.  Additional families would be specified here.  Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families.  The GLM function is shown:

(defn glm 
  ([y X & opts]
      (let [opts (when opts (apply assoc {} opts))
       family (or (families (:family opts)) 
              (:gaussian families))
       intercept (or (:intercept opts) true)
       eps (or (:eps opts) 0.01)
       bstart (:bstart opts)]
       (irls y 
         X 
         bstart (:invlink family) 
         (:dlink family) 
         (:weight family) 
         eps))))

The GLM function simply delegates to the IRLS function with the distribution specific link, inverse link, etc functions.

To test the GLM, I used the example from the Incanter linear-model documentation:

user=> (use '(incanter core stats datasets charts))
nil
user=> (def iris 
  (to-matrix (get-dataset :iris) :dummies true))
#'user/iris
user=> (def y (sel iris :cols 0))
#'user/y
user=> (def x (sel iris :cols (range 1 6)))
#'user/x
user=> (def iris-lm (linear-model y x))
#'user/iris-lm
user=> (:coefs iris-lm)
(2.171266292153149 0.4958889383890437 0.8292439122349187 -0.31515517332664444 -1.0234978144907245 -0.7235619577805039)

Now, does the GLM with the Gaussian family give the same coefficients?  First, we add an intercept column.

user=> (def x (bind-columns (repeat 150 1) x))
#'user/x
user=> (glm y x 
     :bstart (matrix [1 1 1 1 1 1]) 
     :family :gaussian)
[ 2.1713
 0.4959
 0.8292
-0.3152
-1.0235
-0.7236]

Finally, to test the binomial family, I used the "infert" dataset from R:

user=> (def sp (matrix [2 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 2 1 1 2 2 2 2 0 1 0 0 2 0 2 1 2 0 1 2 0 0 1 0 0 2 0 0 2 2 2 1 1 2 2 0 2 1 2 2 1 1 2 0 1 1 2 2 0 0 1 1 2 2 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 2 0 1 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 0 0 1 1 0 0 0 2 0 2 0 1 0 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 1 0 1 1 1 0 0 1 1]))
#'user/sp
user=> (def in (matrix [1 1 2 2 1 2 0 0 0 0 1 2 1 2 1 2 2 0 2 0 0 2 0 0 1 0 0 0 1 2 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 2 2 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 2 0 0 2 0 2 0 2 1 0 2 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 1 0 0 0 0 0 2 1 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 2 0 0 0 0 0 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 1 2 1 1 2 2 2 0 1 0 2 1 0 1 1 1 0 1 0 1 0 2 0 1 0 1 0 0 1 1 0 0 0 0 2 0 0]))
#'user/in
user=> (def case (matrix [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]))
#'user/case
user=> (def X (bind-columns (repeat (length sp) 1) sp in))
#'user/X
user=> (glm case X 
  :bstart (matrix [0 1 1]) 
  :family :binomial 
  :eps 0.001)
[-1.7078
 1.1972
 0.4182]

 

Wednesday
27Jan2010

Monte Carlo Pi calc

What is the first app that you code up in a new language that you are learning?  I imagine most people start with the canonical "Hello World" and then move on to their own specific app.  A colleague of mine always codes up the Mandelbrot set which typically involves implementing a complex number class with its associated operations - good for OO languages.  For mathematical and statistical languages and APIs, I always start with Monte Carlo PI calc, a simple variant of Buffon's Needle problem.  The algorithm samples n points from a unit square and then computes the ratio of points that fall within an inscribed circle of radius .5 to the total number of samples.  This ratio should approach the area of the inscribed circle.  Therefore, PI can be computed as the simulated ratio divided by 0.25.

To continue learning Clojure and Incanter, I implemented the algorithm as follows:

(defn mc-pi-calc [n] (/ (count (filter #(<= %1 0.5)
 (map #(euclidean-distance (vec %1) [0.5 0.5]) 
   (partition 2 (sample-uniform (* 2 n)))))) 
     (* n 0.5 0.5)))
user=> (mc-pi-calc 10000)
3.1432

The function takes the number of samples as input and uses the sample-uniform function to generate n points in the unit square.  Then, it counts the number of points that fall within a circle inscribed in the unit square using the euclidean distance function and divides this count by the total number of samples to get the area of the circle to area of the square ratio.  From this, the value of PI is easily calculated.

The simulation can be visualized using a scatter plot from Incanter's charts API.

;; define the sample points
(def data (trans (map vec (partition 2 
    (sample-uniform (* 2 10000))))))

;; plot the sample points
(def p (scatter-plot (first data) (second data)))

;; overlay the points in the circle
(def data2 (trans 
   (filter #(>= 0.5 
      (euclidean-distance %1 [0.5 0.5])) 
   (trans data))))
(add-points p (first data2) (rest data2))

;; view the resulting plot
(view p)

 The code above produces the following chart.

 

The Monte Carlo Pi calc algorithm can also nicely illustrate the weak law of large numbers.  Clojure's parallel processing pmap function came in handy for this task.  The weak law of large numbers states that the probability that the sample average approaches the actual within some error approaches one as the number of samples approaches infinity. So, to demonstrate that, we define one sample as one computation of Pi fixing the number of random points at 100. Then, to obtain an estimate of the probability of the sample average being within an error fixed at 0.01, we take the average of 10 estimates of Pi 100 times and count the number that fell within the error. We repeat this raising the number of samples each time. The following code snippet implements the algorithm:

 

(def data (pmap (fn [nsamples] 
    (take 100 (repeatedly (fn [] 
       (take nsamples 
           (repeatedly #(mc-pi-calc 100))))))) 
    (range 10 100 1)))

(pmap (fn [exp] (let 
   [d (map #(/ (sum %1) (count %1)) exp)] 
   (/ (count (filter #(<= (abs (- %1 3.14159)) 0.01) d))
      (double (count d))))) 
   data)

 

The pmap function automatically threads the computationally intensive function over the input list using all available processors.  This meant my four core laptop churned for a while on this function.  The nice part of pmap is that I did not have to do anything special to get this multi-threaded functionality.  And there is no reason why pmap couldn't distribute the processing across a map-reduce cluster.

Admittedly, the MC pi calc algorithm does not converge very fast but it does illustrate the simulation capabilities of Clojure and Incanter well.

Saturday
23Jan2010

Functional programming and root finding

I recently discovered Incanter which looks really promising for statistical computing on the JVM.  Incanter is written in Clojure, a lisp like functional programming language for the JVM.  We have been using Scala, a hybrid OO/functional programming language for the JVM, in one of our applications but I have yet to find a robust statistics API for Scala.  We also use R in the same application. It would be nice to stay within the JVM for statistical procedures rather than communicate between the JVM and an R session.

I wanted to investigate Incanter, but first I needed to wrap my head around Clojure.  Folding and nesting are common procedures in functional programming languages and in numerical methods.  You can find an excellent discussion of folding here.  Many algorithms follow the iterate and accumulate procedure that naturally maps to the folding and nesting paradigms.  Newton's Method for polynomial root finding follows this paradigm as do many optimization algorithms that converge on an extrema.  As an avid Mathematica enthusiast, I often used NestList to implement and visualize the steps of nesting algorithms, but I haven't found the equivalent built into the libraries of any of the other three languages.  So, to dive in to Clojure and Incanter, I decided to implement the root finding in Clojure, Scala, and R for comparison.  First, we need a NestList equivalent.  Then, we can use the NestList function to implement the root finding algorithm which we'll test out on the trivial polynomial x^2-5 which obviously has its root at ~±2.23606797749979.

The root finding algorithm using NestList in Mathematica can be viewed here.

In Clojure, the implementation and usage is as follows:

(defn nestlist [fn iv n] (take n (iterate fn iv)))
(defn findroot [f df iv n] 
     (nestlist #(- %1 (/ (f %1) (df %1))) (double iv) n))
user=> (findroot #(- (* %1 %1) 5) #(* 2 %1) 1.0 20)
(1.0 3.0 2.3333333333333335 2.238095238095238 2.2360688956433634 2.236067977499978 2.23606797749979 2.23606797749979 2.23606797749979 2.23606797749979)

The iterate function is exactly what I was looking for.  The lazy evaluation of the sequence makes it easy to work with and the definition of the nestlist function is just syntactic sugar on the iterate function.

In Scala, the implementation of nestlist uses the foldLeft function and accumulates its results 

def nestlist(f:(Double)=>Double, iv: Double, n: Int): List[Double]={
   (0 until n).foldLeft(List(f(iv)))
        ((xs,i) => xs ++ List(f(xs.head)))
}
def findroot(f:(Double)=>Double,
     df:(Double)=>Double,iv:Double,n:Int):List[Double]={
   nestlist((x)=>x-f(x)/df(x),iv,n)
}

 The accumulating list is a bit awkward.  I'm sure there are cleaner methods for implementing nestlist in Scala.

 Usage of the method:

findroot((x)=>x*x-5,(x)=>2*x,1,10)
res1: List[Double] = List(3.0, 2.3333333333333335, 2.238095238095238, 2.2360688956433634, 2.236067977499978, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979)

In R, I had to resort to a very non-FP for loop. I looked at the apply family of functions and replicate but couldn't come up with a good algorithm quickly so here is the result.

nestlist=function(f, iv, n) { 
   acc=as.vector(iv)
   for(e in 1:n) acc=append(acc, f(acc[e]))
   acc
}
findroot=function(f, df, iv, n) 
   nestlist(function(x) x-f(x)/df(x), iv, n)

And its usage:

> findroot(function(x) x^2-5, function(x) 2*x, 1, 10)
 [1] 1.000000 3.000000 2.333333 2.238095 2.236069 2.236068 2.236068 2.236068
 [9] 2.236068 2.236068 2.236068

Next steps are to start playing around with Incanter and implement some statistical procedures that utilize the root finding algorithm.

Thursday
21Jan2010

Python Static Dictionaries in Nearest Neighbor Queries

A standard query on geospatial data is the nearest neighbor query, i.e. Select the five closest police stations from a given point.  The brute force approach to this problem is joining the two tables spatially and sorting by distance limiting the result to the number requested.  Of course, for very large tables, this is extremely costly.  That's where spatial indexes come in.  PostGIS implements GiST indexes which are a general form of index that is capable of handling any kind of data with user defined keys.  In the case of a nearest neighbor query, the index is used to narrow down the number of items to perform a distance calculation against.  This vastly improves the performance of a nearest neighbor query.  A very effective algorithm for nearest neighbor queries can be found here.  This algorithm effectively grows the search area in a smart fashion until the right number of features are captured.  This reduces the overall number of distance calculations required.  The user still needs to provide an initial box in which to search, the smaller the better since it will grow.

There are situations in which one would like to know all the nearest neighbors between one feature and another feature.  This could be useful in diverse areas such as real estate planning (what is the best place to develop given proximity requirements to schools and grocery stores?) to disaster management (what is the most accessible and safest point between these hardest hit areas?).  The nearest neighbor query now has to proceed sequentially along all geometries of one feature computing nearest neighbors to all geometries along another feature.  This becomes computationally intractable as the size of the feature tables grow.

An easy way to improve this type of query is to exploit the spatial proximity of adjacent features in a table.  Using a smart sequential scan, the algorithm "remembers" the nearest neighbor distance from the last feature and uses that as the initial box size for the next feature.  The Python language extension provides a static dictionary that a stored procedure has access to to facilitate this kind of operation.  This may be quite obvious to everyone and of course it is plainly listed in the Postgres documentation, but somehow I still managed to overlook it for the longest time while trying to solve this problem.  The nearest neighbor distance can be stored and retrieved as in the following code block:

 

create or replace function nn(geom geometry, 
   featuretable text, 
   geomcol text, 
   initdist double precision, n int) 
   returns double precision
AS $$
    curdist = initdist
    key = featuretable
    if SD.has_key(key):
            curdist = int(SD[key])
    else:
            SD[key] = initdist

    // perform nearest neighbor query using curdist

$$ LANGUAGE plpythonu;

Keep in mind that this is not very threadsafe.  Two simultaneous nearest neighbor queries on the same table will interfere with each others' stored distance.  I'd imagine that one could use some data about the query plan to store the distance uniquely, but that is beyond my Postgres skills for the moment.