Saturday
Mar122011

Stochastic Gradient Descent

Most machine learning algorithms and statistical inference techniques operate on the entire dataset.  Think of ordinary least squares regression or estimating generalized linear models.  The minimization step of these algorithms is either performed in place in the case of OLS or on the global likelihood function in the case of GLM.  This property prevents the algorithms from being easily scaled - the statistical models have to be re-estimated every time a new data point arrives.  For very large datasets or for datasets with high throughput, this re-estimation step quickly becomes computationally prohibitive.  The result is out-of-date models or complex down-sampling logic in the analytical code.

To address this situation in some of the projects we are working on, we've been looking at algorithms that operate on streams of data.  One such algorithm for estimating GLMs is stochastic gradient descent.  SGD allows one to implement "online" statistical learning algorithms in the sense that as new data points arrive, the model parameters are updated in real-time.  This has some tremendous advantages - it is highly scalable (memory and computation time) as it operates on one data point at a time, it is real-time and the current optimal model is always available, etc.  In order to evaluate the effectiveness of this optimization technique, I put together a demonstration of using SGD to learn an ordinary least squares regression model of a single variable.

The demonstration shows the global likelihood surface of the regression model.  The blue point is the maximum likelihood estimate - equivalent to the least squares solution.  The red point is the current values of the SGD algorithm.  If you click through to the interactive demonstration, you can observe how the model parameters update as each data point is passed through the SGD algorithm.  As more data points are fed to the SGD algorithm, the SGD parameter estimation converges on the maximum likelihood estimation.  It is important to note that while we are showing how the algorithm converges in the global likelihood surface, it is in fact not using the surface at all - rather it is operating on the local gradient attributable to each new data point.  This is what makes the algorithm scalable and on-line.  

The image on the right shows the SGD regression line and how it approaches the MLE regression line.  

There are some potential drawbacks to the SGD algorithm.  For one, it may not converge on the optimal model parameters.  Second, it is hard to imagine how feature selection could be performed online without a significant ramp up in computational time and memory usage.  Third, complex logic needs to be incorporated into the algorithm to decay the impact that old data has on the real-time model and to accentuate the significance of the most recent data.  Nevertheless, the overall technique provides significant boosts to efficiency which may outweigh the issues.

 

 

Saturday
Jul312010

Destructuring in Mathematica

A technique that I have particularily useful in Lisp-like languages like Mathematica and Clojure is destructuring.  Destructuring is a mechanism for extracting parts of an expression.  The Lisp "code as data" paradigm lends itself to destructuring techniques.  I recently leveraged destructuring to programmatically modify some graphics I was developing to visualize recursive partitioning techniques. 

The graphics object that represents a recursive partitioning on a dataset is a dendrogram.  Mathematica provides a Dendrogram function that will visualize nested clusters, but it does not provide a way to label the branches on the Dendrogram.  The original dendrogram looks like the following.

 

In Mathematica, this object can also be manipulated in its original form.  The original form is an expression.

 

Graphics[List[List[RGBColor[0, 1, 0], List[]], 
  List[List[], 
   List[List[
     Line[List[List[1, 0], List[1, 2], List[2, 2], List[2, 0]]], 
     Line[List[List[3, 0], List[3, 2], List[4, 2], List[4, 0]]], 
     Line[List[List[1.5`, 2], List[1.5`, 4], List[3.5`, 4], 
       List[3.5`, 2]]]]]], 
  List[Text[Text[Style["B", RGBColor[0, 0, 1]]], 
    Offset[List[0, -4], List[1, 0]], List[0, 1]], 
   Text[Text[Style["A", RGBColor[1, 0, 0]]], 
    Offset[List[0, -4], List[2, 0]], List[0, 1]], 
   Text[Text[Style["A", RGBColor[1, 0, 0]]], 
    Offset[List[0, -4], List[3, 0]], List[0, 1]], 
   Text[Text[Style["B", RGBColor[0, 0, 1]]], 
    Offset[List[0, -4], List[4, 0]], List[0, 1]]]], 
 List[Rule[PlotRange, All], 
  Rule[AspectRatio, Power[GoldenRatio, -1]]]]

In the dendrogram above, each path represents a region in space that would be classified with the label at the leaf node.  Each branch node represents a rule that decides which branch of the tree should be followed to classify new data points.  (See the demonstration at the end of the post for details).  In order to get the appropriate text onto the branch nodes, I had to destructure the graphics object and construct a parallel graphics object with the textual elements of the tree.  Mathematica has Position function that takes any Mathematica expression and destructures it using a rich pattern matching library.  Getting the positions of the branch nodes is done as follows:

Extract[FullForm[d],
  Position[FullForm[d], Line[__]]] /. {Line[{_, h_, i_, _}] -> {h, i}}

{{{1, 2}, {2, 2}}, {{3, 2}, {4, 2}}, {{1.5, 4}, {3.5, 4}}}

In the expression above, the call to Position passes in the Line[___] pattern to match at any nested level the Line objects in the graphics object.  Then, the positions are extracted from the full form of the dendrogram graphic object and the center two vertices of the line are pulled out as well.  These center two vertices refer to the horizontal line of each branch node.  We can use these vertices' positions to construct the textual objects appropriately.  The following tree is the final result.

Friday
Apr022010

Latent Semantic Analysis in Solr using Clojure

I recently pushed a very alpha Solr plugin to GitHub that does unsupervised clustering on unstructured text documents.  The plugin is written in Clojure and utilizes the Incanter and associated Parallel Colt libraries.  Solr/Lucene builds an inverted index of term to document mappings.  This inverted index is exploited to perform Latent Semantic Analysis.  In a nutshell, LSA attempts to extract concepts from a term-document matrix.  A term-document matrix contains elements that indicate the frequency or some weighting of the frequency of terms in a document.  The key to LSA is rank reduction which is performed by extracting the Singular Value Decomposition of the term-document matrix.  The k highest singular values are selected from the SVD and the document-concept and term-concept matrices are reduced to rank k.  This has the effect of reducing noise due to extraneous words which in turn leads to better clustering.  In a subsequent post, I will discuss how to measure the performance of this algorithm.

I have tested the algorithm on 20 Newsgroups data set.  I started with only two newsgroups to see how well the algorithm performed.  The following chart shows the two sets of documents projected into two dimensions of the concept space.

The blue points represent documents from the sci.space newsgroup and the red points  from the rec.sports.baseball newsgroup.  One can see that the algorithm has effectively separated these two groups in the concept space.  There is some overlap in the center as well as some outliers.  As a result of the overlap, there was some mis-classification.  However, the actual clustering implemented so far is not very sophisticated.  It simply selects the most similar centroid based on cosine similarity.  A more effective clustering implementation would involve agglomerative clustering or some form of k-means clustering.

Another nice effect of SVD is the ability to extract the concept vectors.  These serve to characterize the clusters.  One can use these concept vectors to induce labels or to profile clusters.  Some of the concept vectors for the above example are:

  • us mission abort firm pegasus data pacastro system communic m contract ventur servic probe commerci  market space satellit launch
  • homer win astro saturday eighth friday sunday hit doublehead klein cub second third home game run score inning doubl

These are just two of the concept vectors.  There are k concept vectors where k is the specified reduced rank supplied to the LSA algorithm.  The next step is to map the cluster centroids to the concept vectors.

Currently, the LSA algorithm uses Parallel Colt's SVD so the matrix algebra is done in-memory.  This means that it will only work for small numbers (300-500) of documents.  The next step is to investigate moving to Apache Mahout's distributed matrix library.

Friday
Feb192010

PostGIS BBOX Query Gotcha

I got stung by this one after processing quite a bit of data.  When doing a nearest neighbor search, I have been leveraging the GiST index functionality in PostGIS.  The documentation describes how to take advantage of these indexes be using the && operator to first find overlapping bounding boxes and then do the compute intensive calculation on the smaller subset of matched features.  However, there is a condition in which the overlapping bounding box operator does not return the nearest features.  Perhaps this is well know, but I got hit by it.

Consider the case of searching for the nearest road from a point on a map.  A natural way of performing this search is to expand a bounding box around the point and use the && operator to select the roads that intersect that bounding box.  Then, the distance to each of those roads in the returned subset is computed and the minimum distance is returned.  Observe the following scenario:

The tiny square in the upper right (just below and to the right of the rectangle) is the bbox of the point from which we wish to find the nearest road.  The yellow rectangle is the bbox of the nearest road.  The large transparent blue rectangle is a bbox of the next nearest road.  So the only overlapping bbox for the point is the large rectangle.  Thus, the && operator does not find the nearest road and our calculation is wrong.

The solution I have come up with so far is to use the bbox operator, compute the nearest distance, and use a new bbox expanded around the point using the just computed distance.  This operation will find any overlapping bbox within that range and will come up with the correct nearest road.  I don't like this solution as it requires two && searches and multiple distance computations - not very optimal.

Wednesday
Feb172010

Incanter and the GLM

I read somewhere that the Generalized Linear Model is the "workhorse of statistics" though I cannot seem to find the reference anymore.  The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions.  Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables.  This transformation is called the link function and is unique for each distribution in the exponential family.  Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation.  In particular, iteratively reweighted least squares of the likelihood function has been shown to converge on the MLE.

To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm.  If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:

(defn irls [y X B invlink dlink weight eps]
  (let [
    _irls (fn [Bnext] 
          (let [
        eta (mmult X Bnext)
        mu (invlink eta)
        z (plus eta (mult (minus y mu) (dlink mu)))
        W (diag (weight mu))]
        (mmult 
         (solve (mmult (trans X) W X)) 
         (trans X) W z)))
    ]
    (last 
     (last 
      (take-while 
       (fn [x] (> (euclidean-distance 
           (first x) 
           (last x)) eps)) 
       (partition 2 1 (iterate _irls B)))))))

In the above code, we define the update step as an internal function of the updated coefficients variable.  Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.

Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions.  I have shown Gaussian and Binomial distributions below:

(defstruct family :link :invlink :dlink :weight)
(def families
     {
     :gaussian (struct-map family
           :link (fn [x] x)
           :invlink (fn [x] x)
           :dlink (fn [x] 1)
           :weight (fn [mu] (repeat (length mu) 1)))
     :binomial (struct-map family
           :link (fn [x] (log (div x (minus 1 x))))
           :invlink (fn [x] (div (exp x) (plus 1 (exp x))))
           :dlink (fn [x] [x] (div 1 (mult x (minus 1 x))))
           :weight (fn [mu] (to-vect (mult mu (minus 1 mu)))))
      })

I have used the struct-map technique from Clojure which gives me a sort of family type.  Additional families would be specified here.  Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families.  The GLM function is shown:

(defn glm 
  ([y X & opts]
      (let [opts (when opts (apply assoc {} opts))
       family (or (families (:family opts)) 
              (:gaussian families))
       intercept (or (:intercept opts) true)
       eps (or (:eps opts) 0.01)
       bstart (:bstart opts)]
       (irls y 
         X 
         bstart (:invlink family) 
         (:dlink family) 
         (:weight family) 
         eps))))

The GLM function simply delegates to the IRLS function with the distribution specific link, inverse link, etc functions.

To test the GLM, I used the example from the Incanter linear-model documentation:

user=> (use '(incanter core stats datasets charts))
nil
user=> (def iris 
  (to-matrix (get-dataset :iris) :dummies true))
#'user/iris
user=> (def y (sel iris :cols 0))
#'user/y
user=> (def x (sel iris :cols (range 1 6)))
#'user/x
user=> (def iris-lm (linear-model y x))
#'user/iris-lm
user=> (:coefs iris-lm)
(2.171266292153149 0.4958889383890437 0.8292439122349187 -0.31515517332664444 -1.0234978144907245 -0.7235619577805039)

Now, does the GLM with the Gaussian family give the same coefficients?  First, we add an intercept column.

user=> (def x (bind-columns (repeat 150 1) x))
#'user/x
user=> (glm y x 
     :bstart (matrix [1 1 1 1 1 1]) 
     :family :gaussian)
[ 2.1713
 0.4959
 0.8292
-0.3152
-1.0235
-0.7236]

Finally, to test the binomial family, I used the "infert" dataset from R:

user=> (def sp (matrix [2 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 2 1 1 2 2 2 2 0 1 0 0 2 0 2 1 2 0 1 2 0 0 1 0 0 2 0 0 2 2 2 1 1 2 2 0 2 1 2 2 1 1 2 0 1 1 2 2 0 0 1 1 2 2 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 2 0 1 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 0 0 1 1 0 0 0 2 0 2 0 1 0 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 1 0 1 1 1 0 0 1 1]))
#'user/sp
user=> (def in (matrix [1 1 2 2 1 2 0 0 0 0 1 2 1 2 1 2 2 0 2 0 0 2 0 0 1 0 0 0 1 2 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 2 2 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 2 0 0 2 0 2 0 2 1 0 2 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 1 0 0 0 0 0 2 1 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 2 0 0 0 0 0 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 1 2 1 1 2 2 2 0 1 0 2 1 0 1 1 1 0 1 0 1 0 2 0 1 0 1 0 0 1 1 0 0 0 0 2 0 0]))
#'user/in
user=> (def case (matrix [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]))
#'user/case
user=> (def X (bind-columns (repeat (length sp) 1) sp in))
#'user/X
user=> (glm case X 
  :bstart (matrix [0 1 1]) 
  :family :binomial 
  :eps 0.001)
[-1.7078
 1.1972
 0.4182]