<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Squarespace Site Server v5.11.81 (http://www.squarespace.com/) on Mon, 06 Feb 2012 16:41:33 GMT--><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><title>Blog</title><link>http://www.ccri.com/blog/</link><description></description><lastBuildDate>Sat, 12 Mar 2011 22:32:20 +0000</lastBuildDate><copyright></copyright><language>en-US</language><generator>Squarespace Site Server v5.11.81 (http://www.squarespace.com/)</generator><item><title>Stochastic Gradient Descent</title><dc:creator>CCRI</dc:creator><pubDate>Sat, 12 Mar 2011 20:59:20 +0000</pubDate><link>http://www.ccri.com/blog/2011/3/12/stochastic-gradient-descent.html</link><guid isPermaLink="false">385495:5586438:10764300</guid><description><![CDATA[<p>Most machine learning algorithms and statistical inference techniques operate on the entire dataset. &nbsp;Think of ordinary least squares regression or estimating generalized linear models. &nbsp;The minimization step of these algorithms is either performed in place in the case of OLS or on the global likelihood function in the case of GLM. &nbsp;This property prevents the algorithms from being easily scaled - the statistical models have to be re-estimated every time a new data point arrives. &nbsp;For very large datasets or for datasets with high throughput, this re-estimation step quickly becomes computationally prohibitive. &nbsp;The result is out-of-date models or complex down-sampling logic in the analytical code.</p>
<p>To address this situation in some of the projects we are working on, we've been looking at algorithms that operate on streams of data. &nbsp;One such algorithm for estimating GLMs is <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent</a>. &nbsp;SGD allows one to implement "online" statistical learning algorithms in the sense that as new data points arrive, the model parameters are updated in real-time. &nbsp;This has some tremendous advantages - it is highly scalable (memory and computation time) as it operates on one data point at a time, it is real-time and the current optimal model is always available, etc. &nbsp;In order to evaluate the effectiveness of this optimization technique, I put together a demonstration of using SGD to learn an ordinary least squares regression model of a single variable.</p>
<p><span class="full-image-float-left ssNonEditable"><span><a href="http://demonstrations.wolfram.com/StochasticGradientDescent/"><img style="width: 240px;" src="http://www.ccri.com/storage/popup_1.jpg?__SQUARESPACE_CACHEVERSION=1299967937570" alt="" /></a></span></span></p>
<p>The demonstration shows the global likelihood surface of the regression model. &nbsp;The blue point is the maximum likelihood estimate - equivalent to the least squares solution. &nbsp;The red point is the current values of the SGD algorithm. &nbsp;If you click through to the interactive demonstration, you can observe how the model parameters update as each data point is passed through the SGD algorithm. &nbsp;As more data points are fed to the SGD algorithm, the SGD parameter estimation converges on the maximum likelihood estimation. &nbsp;It is important to note that while we are showing how the algorithm converges in the global likelihood surface, it is in fact not using the surface at all - rather it is operating on the local gradient attributable to each new data point. &nbsp;This is what makes the algorithm scalable and on-line. &nbsp;</p>
<p><span class="full-image-float-right ssNonEditable"><span><a href="http://demonstrations.wolfram.com/StochasticGradientDescent/"><img src="http://www.ccri.com/storage/popup_3.jpg?__SQUARESPACE_CACHEVERSION=1299968622662" alt="" /></span></a></span></p>
<p>The image on the right shows the SGD regression line and how it approaches the MLE regression line. &nbsp;</p>
<p>There are some potential drawbacks to the SGD algorithm. &nbsp;For one, it may not converge on the optimal model parameters. &nbsp;Second, it is hard to imagine how feature selection could be performed online without a significant&nbsp;ramp up in computational time and memory usage. &nbsp;Third, complex logic needs to be incorporated into the algorithm to decay the impact that old data has on the real-time model and to accentuate the significance of the most recent data. &nbsp;Nevertheless, the overall technique provides significant boosts to efficiency which may outweigh the issues.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-10764300.xml</wfw:commentRss></item><item><title>Destructuring in Mathematica</title><dc:creator>CCRI</dc:creator><pubDate>Sat, 31 Jul 2010 15:17:05 +0000</pubDate><link>http://www.ccri.com/blog/2010/7/31/destructuring-in-mathematica.html</link><guid isPermaLink="false">385495:5586438:8416192</guid><description><![CDATA[<p>A technique that I have particularily useful in Lisp-like languages like Mathematica and Clojure is destructuring.&nbsp; Destructuring is a mechanism for extracting parts of an expression.&nbsp; The Lisp "code as data" paradigm lends itself to destructuring techniques.&nbsp; I recently leveraged destructuring to programmatically modify some graphics I was developing to visualize <a href="http://bit.ly/9HKzom">recursive partitioning</a> techniques.&nbsp;</p>
<p><span class="full-image-float-right ssNonEditable"><span><a href="http://demonstrations.wolfram.com/RecursivePartitioningForSupervisedLearning/"><img src="http://demonstrations.wolfram.com/RecursivePartitioningForSupervisedLearning/thumbnail_174.jpg?__SQUARESPACE_CACHEVERSION=1283021028862" alt="" /></a></span></span></p>
<p>The graphics object that represents a recursive partitioning on a dataset is a dendrogram.&nbsp; Mathematica provides a Dendrogram function that will visualize nested clusters, but it does not provide a way to label the branches on the Dendrogram.&nbsp; The original dendrogram looks like the following.</p>
<p><span class="full-image-block ssNonEditable"><span><img src="http://www.ccri.com/storage/rpart.png?__SQUARESPACE_CACHEVERSION=1283013390687" alt="" /></span></span></p>
<p>&nbsp;</p>
<p>In Mathematica, this object can also be manipulated in its original form.&nbsp; The original form is an expression.</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>Graphics[List[List[RGBColor[0, 1, 0], List[]], 
  List[List[], 
   List[List[
     Line[List[List[1, 0], List[1, 2], List[2, 2], List[2, 0]]], 
     Line[List[List[3, 0], List[3, 2], List[4, 2], List[4, 0]]], 
     Line[List[List[1.5`, 2], List[1.5`, 4], List[3.5`, 4], 
       List[3.5`, 2]]]]]], 
  List[Text[Text[Style["B", RGBColor[0, 0, 1]]], 
    Offset[List[0, -4], List[1, 0]], List[0, 1]], 
   Text[Text[Style["A", RGBColor[1, 0, 0]]], 
    Offset[List[0, -4], List[2, 0]], List[0, 1]], 
   Text[Text[Style["A", RGBColor[1, 0, 0]]], 
    Offset[List[0, -4], List[3, 0]], List[0, 1]], 
   Text[Text[Style["B", RGBColor[0, 0, 1]]], 
    Offset[List[0, -4], List[4, 0]], List[0, 1]]]], 
 List[Rule[PlotRange, All], 
  Rule[AspectRatio, Power[GoldenRatio, -1]]]]
</code></pre>
<p>In the dendrogram above, each path represents a region in space that would be classified with the label at the leaf node. &nbsp;Each branch node represents a rule that decides which branch of the tree should be followed to classify new data points. &nbsp;(See the&nbsp;demonstration&nbsp;at the end of the post for details). &nbsp;In order to get the appropriate text onto the branch nodes, I had to destructure the graphics object and construct a parallel graphics object with the textual elements of the tree. &nbsp;Mathematica has Position function that takes any Mathematica expression and destructures it using a rich pattern matching library. &nbsp;Getting the positions of the branch nodes is done as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code><p>Extract[FullForm[d], <br /> &nbsp; Position[FullForm[d], Line[__]]] /. {Line[{_, h_, i_, _}] -&gt; {h, i}}</p><p>{{{1, 2}, {2, 2}}, {{3, 2}, {4, 2}}, {{1.5, 4}, {3.5, 4}}}</p></code></pre>
<p>In the expression above, the call to Position passes in the Line[___] pattern to match at any nested level the Line objects in the graphics object. &nbsp;Then, the positions are extracted from the full form of the dendrogram graphic object and the center two vertices of the line are pulled out as well. &nbsp;These center two vertices refer to the horizontal line of each branch node. &nbsp;We can use these vertices' positions to construct the textual objects appropriately. &nbsp;The following tree is the final result.</p>
<p><span class="full-image-block ssNonEditable"><span><img src="http://www.ccri.com/storage/finalrparttree.png?__SQUARESPACE_CACHEVERSION=1283014344937" alt="" /></span></span></p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-8416192.xml</wfw:commentRss></item><item><title>Latent Semantic Analysis in Solr using Clojure</title><dc:creator>CCRI</dc:creator><pubDate>Fri, 02 Apr 2010 15:14:07 +0000</pubDate><link>http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html</link><guid isPermaLink="false">385495:5586438:7211797</guid><description><![CDATA[<p>I recently pushed a very alpha <a href="http://github.com/algoriffic/lsa4solr">Solr plugin</a>&nbsp;to GitHub that does unsupervised clustering on unstructured text documents. &nbsp;The plugin is written in Clojure and utilizes the Incanter and associated Parallel Colt libraries. &nbsp;Solr/Lucene builds an inverted index of term to document mappings. &nbsp;This inverted index is exploited to perform <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Analysis</a>. &nbsp;In a nutshell, LSA attempts to extract concepts from a term-document matrix. &nbsp;A term-document matrix contains elements that indicate the frequency or some weighting of the frequency of terms in a document. &nbsp;The key to LSA is rank reduction which is performed by extracting the <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a>&nbsp;of the term-document matrix. &nbsp;The k highest singular values are selected from the SVD and the document-concept and term-concept matrices are reduced to rank k. &nbsp;This has the effect of reducing noise due to extraneous words which in turn leads to better clustering. &nbsp;In a subsequent post, I will discuss how to measure the performance of this algorithm.</p>
<p>I have tested the algorithm on <a href="http://people.csail.mit.edu/jrennie/20Newsgroups/">20 Newsgroups</a>&nbsp;data set. &nbsp;I started with only two newsgroups to see how well the algorithm performed. &nbsp;The following chart shows the two sets of documents projected into two dimensions of the concept space.</p>
<p><span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/science_baseball.png?__SQUARESPACE_CACHEVERSION=1270223265849" alt="" /></span></span></p>
<p>The blue points represent documents from the sci.space newsgroup and the red points &nbsp;from the rec.sports.baseball newsgroup. &nbsp;One can see that the algorithm has effectively separated these two groups in the concept space. &nbsp;There is some overlap in the center as well as some outliers. &nbsp;As a result of the overlap, there was some mis-classification. &nbsp;However, the actual clustering implemented so far is not very sophisticated. &nbsp;It simply selects the most similar centroid based on cosine similarity. &nbsp;A more effective clustering implementation would involve agglomerative clustering or some form of k-means clustering.</p>
<p>Another nice effect of SVD is the ability to extract the concept vectors. &nbsp;These serve to characterize the clusters. &nbsp;One can use these concept vectors to induce labels or to profile clusters. &nbsp;Some of the concept vectors for the above example are:</p>
<ul>
<li>﻿us mission abort firm pegasus data pacastro system communic m contract ventur servic probe commerci &nbsp;market space satellit launch﻿﻿﻿</li>
<li>homer win astro saturday eighth friday sunday hit doublehead klein cub second third home game run score inning doubl</li>
</ul>
<p>These are just two of the concept vectors. &nbsp;There are k concept vectors where k is the specified reduced rank supplied to the LSA algorithm. &nbsp;The next step is to map the cluster centroids to the concept vectors.</p>
<p>Currently, the LSA algorithm uses Parallel Colt's SVD so the matrix algebra is done in-memory. &nbsp;This means that it will only work for small numbers (300-500) of documents. &nbsp;The next step is to investigate moving to Apache Mahout's distributed matrix library.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-7211797.xml</wfw:commentRss></item><item><title>PostGIS BBOX Query Gotcha</title><dc:creator>CCRI</dc:creator><pubDate>Fri, 19 Feb 2010 14:34:32 +0000</pubDate><link>http://www.ccri.com/blog/2010/2/19/postgis-bbox-query-gotcha.html</link><guid isPermaLink="false">385495:5586438:6755740</guid><description><![CDATA[<p>I got stung by this one after processing quite a bit of data. &nbsp;When doing a nearest neighbor search, I have been leveraging the GiST index functionality in PostGIS. &nbsp;The documentation describes how to <a href="http://postgis.refractions.net/docs/ch04.html#id2806074">take advantage</a> of these indexes be using the &amp;&amp; operator to first find overlapping bounding boxes and then do the compute intensive calculation on the smaller subset of matched features. &nbsp;However, there is a condition in which the overlapping bounding box operator does not return the nearest features. &nbsp;Perhaps this is well know, but I got hit by it.</p>
<p>Consider the case of searching for the nearest road from a point on a map. &nbsp;A natural way of performing this search is to expand a bounding box around the point and use the &amp;&amp; operator to select the roads that intersect that bounding box. &nbsp;Then, the distance to each of those roads in the returned subset is computed and the minimum distance is returned. &nbsp;Observe the following scenario:</p>
<p><span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/gotcha.png?__SQUARESPACE_CACHEVERSION=1266592927783" alt="" /></span></span></p>
<p>The tiny square in the upper right (just below and to the right of the rectangle) is the bbox of the point from which we wish to find the nearest road. &nbsp;The yellow rectangle is the bbox of the nearest road. &nbsp;The large transparent blue rectangle is a bbox of the next nearest road. &nbsp;So the only overlapping bbox for the point is the large rectangle. &nbsp;Thus, the &amp;&amp; operator does not find the nearest road and our calculation is wrong.</p>
<p>The solution I have come up with so far is to use the bbox operator, compute the nearest distance, and use a new bbox expanded around the point using the just computed distance. &nbsp;This operation will find any overlapping bbox within that range and will come up with the correct nearest road. &nbsp;I don't like this solution as it requires two &amp;&amp; searches and multiple distance computations - not very optimal.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6755740.xml</wfw:commentRss></item><item><title>Incanter and the GLM</title><dc:creator>CCRI</dc:creator><pubDate>Wed, 17 Feb 2010 15:54:06 +0000</pubDate><link>http://www.ccri.com/blog/2010/2/17/incanter-and-the-glm.html</link><guid isPermaLink="false">385495:5586438:6724641</guid><description><![CDATA[<p>I read somewhere that the <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">Generalized Linear Model</a> is the "workhorse of statistics" though I cannot seem to find the reference anymore. &nbsp;The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions. &nbsp;Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables. &nbsp;This transformation is called the link function and is unique for each distribution in the exponential family. &nbsp;Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation. &nbsp;In particular, <a href="http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares">iteratively reweighted least squares</a> of the likelihood function has been shown to converge on the MLE.</p>
<p>To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm. &nbsp;If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn irls [y X B invlink dlink weight eps]
  (let [
    _irls (fn [Bnext] 
          (let [
        eta (mmult X Bnext)
        mu (invlink eta)
        z (plus eta (mult (minus y mu) (dlink mu)))
        W (diag (weight mu))]
        (mmult 
         (solve (mmult (trans X) W X)) 
         (trans X) W z)))
    ]
    (last 
     (last 
      (take-while 
       (fn [x] (&gt; (euclidean-distance 
           (first x) 
           (last x)) eps)) 
       (partition 2 1 (iterate _irls B)))))))</code></pre>
<p>In the above code, we define the update step as an internal function of the updated coefficients variable. &nbsp;Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.</p>
<p>Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions. &nbsp;I have shown Gaussian and Binomial&nbsp;distributions below:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defstruct family :link :invlink :dlink :weight)
(def families
     {
     :gaussian (struct-map family
           :link (fn [x] x)
           :invlink (fn [x] x)
           :dlink (fn [x] 1)
           :weight (fn [mu] (repeat (length mu) 1)))
     :binomial (struct-map family
           :link (fn [x] (log (div x (minus 1 x))))
           :invlink (fn [x] (div (exp x) (plus 1 (exp x))))
           :dlink (fn [x] [x] (div 1 (mult x (minus 1 x))))
           :weight (fn [mu] (to-vect (mult mu (minus 1 mu)))))
      })
</code></pre>
<p>I have used the struct-map technique from Clojure which gives me a sort of family type. &nbsp;Additional families would be specified here. &nbsp;Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families. &nbsp;The GLM function is shown:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn glm 
  ([y X &amp; opts]
      (let [opts (when opts (apply assoc {} opts))
       family (or (families (:family opts)) 
              (:gaussian families))
       intercept (or (:intercept opts) true)
       eps (or (:eps opts) 0.01)
       bstart (:bstart opts)]
       (irls y 
         X 
         bstart (:invlink family) 
         (:dlink family) 
         (:weight family) 
         eps))))
</code></pre>
<p>The GLM function simply delegates to the IRLS function with the distribution specific link, inverse link, etc functions.</p>
<p>To test the GLM, I used the example from the Incanter linear-model documentation:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (use '(incanter core stats datasets charts))
nil
user=&gt; (def iris 
  (to-matrix (get-dataset :iris) :dummies true))
#'user/iris
user=&gt; (def y (sel iris :cols 0))
#'user/y
user=&gt; (def x (sel iris :cols (range 1 6)))
#'user/x
user=&gt; (def iris-lm (linear-model y x))
#'user/iris-lm
user=&gt; (:coefs iris-lm)
(2.171266292153149 0.4958889383890437 0.8292439122349187 -0.31515517332664444 -1.0234978144907245 -0.7235619577805039)
</code></pre>
<p>Now, does the GLM with the Gaussian family give the same coefficients? &nbsp;First, we add an intercept column.</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (def x (bind-columns (repeat 150 1) x))
#'user/x
user=&gt; (glm y x 
     :bstart (matrix [1 1 1 1 1 1]) 
     :family :gaussian)
[ 2.1713
 0.4959
 0.8292
-0.3152
-1.0235
-0.7236]
</code></pre>
<p>Finally, to test the binomial family, I used the "infert" dataset from R:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (def sp (matrix [2 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 2 1 1 2 2 2 2 0 1 0 0 2 0 2 1 2 0 1 2 0 0 1 0 0 2 0 0 2 2 2 1 1 2 2 0 2 1 2 2 1 1 2 0 1 1 2 2 0 0 1 1 2 2 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 2 0 1 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 0 0 1 1 0 0 0 2 0 2 0 1 0 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 1 0 1 1 1 0 0 1 1]))
#'user/sp
user=&gt; (def in (matrix [1 1 2 2 1 2 0 0 0 0 1 2 1 2 1 2 2 0 2 0 0 2 0 0 1 0 0 0 1 2 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 2 2 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 2 0 0 2 0 2 0 2 1 0 2 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 1 0 0 0 0 0 2 1 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 2 0 0 0 0 0 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 1 2 1 1 2 2 2 0 1 0 2 1 0 1 1 1 0 1 0 1 0 2 0 1 0 1 0 0 1 1 0 0 0 0 2 0 0]))
#'user/in
user=&gt; (def case (matrix [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]))
#'user/case
user=&gt; (def X (bind-columns (repeat (length sp) 1) sp in))
#'user/X
user=&gt; (glm case X 
  :bstart (matrix [0 1 1]) 
  :family :binomial 
  :eps 0.001)
[-1.7078
 1.1972
 0.4182]

</code></pre>
<p>&nbsp;</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6724641.xml</wfw:commentRss></item><item><title>Monte Carlo Pi calc</title><dc:creator>CCRI</dc:creator><pubDate>Wed, 27 Jan 2010 17:26:45 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/27/monte-carlo-pi-calc.html</link><guid isPermaLink="false">385495:5586438:6443570</guid><description><![CDATA[<p>What is the first app that you code up in a new language that you are learning? &nbsp;I imagine most people start with the canonical "Hello World" and then move on to their own specific app. &nbsp;A colleague of mine always codes up the Mandelbrot set which typically involves implementing a complex number class with its associated operations - good for OO languages. &nbsp;For mathematical and statistical languages and APIs, I always start with Monte Carlo PI calc, a simple variant of <a href="http://en.wikipedia.org/wiki/Buffon's_needle">Buffon's Needle</a> problem. &nbsp;The algorithm samples n points from a unit square and then computes the ratio of points that fall within an inscribed circle of radius .5 to the total number of samples. &nbsp;This ratio should approach the area of the inscribed circle. &nbsp;Therefore, PI can be computed as the simulated ratio divided by 0.25.</p>
<p>To continue learning Clojure and Incanter, I implemented the algorithm as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn mc-pi-calc [n] (/ (count (filter #(&lt;= %1 0.5)
 (map #(euclidean-distance (vec %1) [0.5 0.5]) 
   (partition 2 (sample-uniform (* 2 n)))))) 
     (* n 0.5 0.5)))</code></pre>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (mc-pi-calc 10000)
3.1432</code></pre>
<p>The function takes the number of samples as input and uses the sample-uniform function to generate n points in the unit square. &nbsp;Then, it counts the number of points that fall within a circle inscribed in the unit square using the euclidean distance function and divides this count by the total number of samples to get the area of the circle to area of the square ratio. &nbsp;From this, the value of PI is easily calculated.</p>
<div>
<p>The simulation can be visualized using a scatter plot from Incanter's charts API.</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>;; define the sample points
(def data (trans (map vec (partition 2 
    (sample-uniform (* 2 10000))))))

;; plot the sample points
(def p (scatter-plot (first data) (second data)))

;; overlay the points in the circle
(def data2 (trans 
   (filter #(&gt;= 0.5 
      (euclidean-distance %1 [0.5 0.5])) 
   (trans data))))
(add-points p (first data2) (rest data2))

;; view the resulting plot
(view p)
</code></pre>
<p>&nbsp;The code above produces the following chart.</p>
<p>&nbsp;<span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/mc-pi-calc.png?__SQUARESPACE_CACHEVERSION=1264635969547" alt="" /></span></span></p>
<p>The Monte Carlo Pi calc algorithm can also nicely illustrate the weak law of large numbers. &nbsp;Clojure's parallel processing pmap function came in handy for this task. &nbsp;The weak law of large numbers states that the probability that the sample average approaches the actual within some error approaches one as the number of samples approaches infinity.  So, to demonstrate that, we define one sample as one computation of Pi fixing the number of random points at 100.  Then, to obtain an estimate of the probability of the sample average being within an error fixed at 0.01, we take the average of 10 estimates of Pi 100 times and count the number that fell within the error.  We repeat this raising the number of samples each time.  The following code snippet implements the algorithm:</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(def data (pmap (fn [nsamples] 
    (take 100 (repeatedly (fn [] 
       (take nsamples 
           (repeatedly #(mc-pi-calc 100))))))) 
    (range 10 100 1)))

(pmap (fn [exp] (let 
   [d (map #(/ (sum %1) (count %1)) exp)] 
   (/ (count (filter #(&lt;= (abs (- %1 3.14159)) 0.01) d))
      (double (count d))))) 
   data)
</code></pre>
<p>&nbsp;</p>
<p>The pmap function automatically threads the computationally intensive function over the input list using all available processors. &nbsp;This meant my four core laptop churned for a while on this function. &nbsp;The nice part of pmap is that I did not have to do anything special to get this multi-threaded functionality. &nbsp;And there is no reason why pmap couldn't distribute the processing across a map-reduce cluster.</p>
<p><span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/lln.png?__SQUARESPACE_CACHEVERSION=1264686781424" alt="" /></span></span></p>
<p>Admittedly, the MC pi calc algorithm does not converge very fast but it does illustrate the simulation capabilities of Clojure and Incanter well.</p>
</div>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6443570.xml</wfw:commentRss></item><item><title>Functional programming and root finding</title><dc:creator>CCRI</dc:creator><pubDate>Sat, 23 Jan 2010 21:05:21 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/23/functional-programming-and-root-finding.html</link><guid isPermaLink="false">385495:5586438:6409489</guid><description><![CDATA[<p>I recently discovered <a href="http://incanter.org/">Incanter</a>&nbsp;which looks really promising for statistical computing on the JVM. &nbsp;Incanter is written in Clojure, a lisp like functional programming language for the JVM. &nbsp;We have been using Scala, a hybrid OO/functional programming language for the JVM, in one of our applications but I have yet to find a robust statistics API for Scala. &nbsp;We also use R in the same application.  It would be nice to stay within the JVM for statistical procedures rather than communicate between the JVM and an R session.</p>
<p>I wanted to investigate Incanter, but first I needed to wrap my head around Clojure. &nbsp;Folding and nesting are common procedures in functional programming languages and in numerical methods. &nbsp;You can find an excellent discussion of folding <a href="http://alan.dipert.org/post/307586762/polyglot-folding-ruby-clojure-scala">here</a>. &nbsp;Many algorithms follow the iterate and accumulate procedure that naturally maps to the folding and nesting paradigms. &nbsp;<a href="http://en.wikipedia.org/wiki/Newton's_method">Newton's Method</a>&nbsp;for polynomial root finding follows this paradigm as do many optimization algorithms that converge on an extrema. &nbsp;As an avid Mathematica enthusiast, I often used NestList to implement and visualize the steps of nesting algorithms, but I haven't found the equivalent built into the libraries of any of the other three languages. &nbsp;So, to dive in to Clojure and Incanter, I decided to implement the&nbsp;root finding in Clojure, Scala, and R for comparison. &nbsp;First, we need a NestList equivalent. &nbsp;Then, we can use the NestList function to implement the root finding algorithm which we'll test out on the trivial polynomial x^2-5 which obviously has its root at ~<span style="color: #444444; font-size: 9px; line-height: 14px; white-space: pre-wrap; -webkit-text-size-adjust: none;">&plusmn;2.23606797749979.</span></p>
<p>The root finding algorithm using NestList in Mathematica can be viewed <a href="http://www.wolframalpha.com/input/?i=NestList[(%23+-+((%23^2-5)/(2+%23)))%26,+1,+10]">here</a>.</p>
<p>In Clojure, the implementation and usage is as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn nestlist [fn iv n] (take n (iterate fn iv)))
(defn findroot [f df iv n] 
     (nestlist #(- %1 (/ (f %1) (df %1))) (double iv) n))</code></pre>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (findroot #(- (* %1 %1) 5) #(* 2 %1) 1.0 20)
(1.0 3.0 2.3333333333333335 2.238095238095238 2.2360688956433634 2.236067977499978 2.23606797749979 2.23606797749979 2.23606797749979 2.23606797749979)
</code></pre>
<p>The iterate function is exactly what I was looking for. &nbsp;The lazy evaluation of the sequence makes it easy to work with and the definition of the nestlist function is just syntactic sugar on the iterate function.</p>
<p>In Scala, the implementation of nestlist uses the foldLeft function and accumulates its results&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>def nestlist(f:(Double)=&gt;Double, iv: Double, n: Int): List[Double]={
   (0 until n).foldLeft(List(f(iv)))
        ((xs,i) =&gt; xs ++ List(f(xs.head)))
}
def findroot(f:(Double)=&gt;Double,
     df:(Double)=&gt;Double,iv:Double,n:Int):List[Double]={
   nestlist((x)=&gt;x-f(x)/df(x),iv,n)
}
</code></pre>
<p>&nbsp;The accumulating list is a bit awkward. &nbsp;I'm sure there are cleaner methods for implementing nestlist in Scala.</p>
<p>&nbsp;Usage of the method:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>findroot((x)=&gt;x*x-5,(x)=&gt;2*x,1,10)
res1: List[Double] = List(3.0, 2.3333333333333335, 2.238095238095238, 2.2360688956433634, 2.236067977499978, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979)</code></pre>
<p>In R, I had to resort to a very non-FP for loop.  I looked at the apply family of functions and replicate but couldn't come up with a good algorithm quickly so here is the result.</p>
<div id="_mcePaste">
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>nestlist=function(f, iv, n) { 
   acc=as.vector(iv)
   for(e in 1:n) acc=append(acc, f(acc[e]))
   acc
}
findroot=function(f, df, iv, n) 
   nestlist(function(x) x-f(x)/df(x), iv, n)</code></pre>
</div>
<p>And its usage:</p>
<div id="_mcePaste">
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>&gt; findroot(function(x) x^2-5, function(x) 2*x, 1, 10)
 [1] 1.000000 3.000000 2.333333 2.238095 2.236069 2.236068 2.236068 2.236068
 [9] 2.236068 2.236068 2.236068
</code></pre>
</div>
<p>Next steps are to start playing around with Incanter and implement some statistical procedures that utilize the root finding algorithm.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6409489.xml</wfw:commentRss></item><item><title>Python Static Dictionaries in Nearest Neighbor Queries</title><dc:creator>CCRI</dc:creator><pubDate>Thu, 21 Jan 2010 14:19:33 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/21/python-static-dictionaries-in-nearest-neighbor-queries.html</link><guid isPermaLink="false">385495:5586438:6388976</guid><description><![CDATA[<p>A standard query on geospatial data is the nearest neighbor query, i.e. Select the five closest police stations from a given point. &nbsp;The brute force approach to this problem is joining the two tables spatially and sorting by distance limiting the result to the number requested. &nbsp;Of course, for very large tables, this is extremely costly. &nbsp;That's where spatial indexes come in. &nbsp;PostGIS implements <a href="http://postgis.refractions.net/docs/ch04.html#id2717711">GiST indexes</a>&nbsp;which are a general form of index that is capable of handling any kind of data with user defined keys. &nbsp;In the case of a nearest neighbor query, the index is used to narrow down the number of items to perform a distance calculation against. &nbsp;This vastly improves the performance of a nearest neighbor query. &nbsp;A very effective algorithm for nearest neighbor queries can be found <a href="http://www.bostongis.org/?content_name=postgis_nearest_neighbor_generic">here</a>. &nbsp;This algorithm effectively grows the search area in a smart fashion until the right number of features are captured. &nbsp;This reduces the overall number of distance calculations required. &nbsp;The user still needs to provide an initial box in which to search, the smaller the better since it will grow.</p>
<p>There are situations in which one would like to know all the nearest neighbors between one feature and another feature. &nbsp;This could be useful in diverse areas such as real estate planning (what is the best place to develop given proximity requirements to schools and grocery stores?) to disaster management (what is the most accessible and safest point between these hardest hit areas?). &nbsp;The nearest neighbor query now has to proceed sequentially along all geometries of one feature computing nearest neighbors to all geometries along another feature. &nbsp;This becomes computationally intractable as the size of the feature tables grow.</p>
<p>An easy way to improve this type of query is to exploit the spatial proximity of adjacent features in a table. &nbsp;Using a smart sequential scan, the algorithm "remembers" the nearest neighbor distance from the last feature and uses that as the initial box size for the next feature. &nbsp;The Python language extension provides a static dictionary that a stored procedure has access to to facilitate this kind of operation. &nbsp;This may be quite obvious to everyone and of course it is plainly listed in the Postgres <a href="http://www.postgresql.org/docs/8.4/static/plpython-funcs.html">documentation</a>, but somehow I still managed to overlook it for the longest time while trying to solve this problem. &nbsp;The nearest neighbor distance can be stored and retrieved as in the following code block:</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>create or replace function nn(geom geometry, 
   featuretable text, 
   geomcol text, 
   initdist double precision, n int) 
   returns double precision
AS $$
    curdist = initdist
    key = featuretable
    if SD.has_key(key):
            curdist = int(SD[key])
    else:
            SD[key] = initdist

    // perform nearest neighbor query using curdist

$$ LANGUAGE plpythonu;
</code></pre>
<p>Keep in mind that this is not very threadsafe. &nbsp;Two simultaneous nearest neighbor queries on the same table will interfere with each others' stored distance. &nbsp;I'd imagine that one could use some data about the query plan to store the distance uniquely, but that is beyond my Postgres skills for the moment.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6388976.xml</wfw:commentRss></item><item><title>Median Age as Predictor Variable</title><dc:creator>CCRI</dc:creator><pubDate>Mon, 18 Jan 2010 22:12:10 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/18/median-age-as-predictor-variable.html</link><guid isPermaLink="false">385495:5586438:6362312</guid><description><![CDATA[<p>There is a ton of information in the TIGER Census files at the U.S. Gov Census <a href="http://www.census.gov/geo/www/tiger/tgrshp2009/tgrshp2009.html">site</a>. &nbsp;Unfortunately, it is not easily mapped to geolocations. &nbsp;I had to get the tract level shapefiles and then transform the variables in the data files so that the variables lined up with the tracts. &nbsp;Once I clean up the scripts that I used to do this transformation, I will post them.</p>
<p>The following map shows a section of Philadelphia with zip codes labelled. &nbsp;The median age is shown color coded where lighter green indicates a younger median age and blue means an older median age. &nbsp;I wanted to determine if median age is correlated with homicides. &nbsp;If it turns out that median age is correlated, then law enforcement could use this information to update deployment allocations when a new census comes out. &nbsp;Homicides are marked using a star symbol and are shown for June 2009 to December 2009. &nbsp;</p>
<p><br /><span class="thumbnail-image-block ssNonEditable"><span><a href="javascript:showFullImage('/display/ShowImage?imageUrl=%2Fstorage%2Fwithoutpred.png%3F__SQUARESPACE_CACHEVERSION%3D1263852778659',976,1656);"><img src="http://www.ccri.com/storage/thumbnails/4165020-5424179-thumbnail.jpg?__SQUARESPACE_CACHEVERSION=1263852778660" alt="" /></a></span></span></p>
<p>When I included median age in my model, it came out as a significant predictor. &nbsp;I generated a prediction for the following week using the model that includes median age.</p>
<p><span class="thumbnail-image-block ssNonEditable"><span><a href="javascript:showFullImage('/display/ShowImage?imageUrl=%2Fstorage%2Fheatmap.png%3F__SQUARESPACE_CACHEVERSION%3D1263853313454',976,1656);"><img src="http://www.ccri.com/storage/thumbnails/4165020-5424262-thumbnail.jpg?__SQUARESPACE_CACHEVERSION=1263853313455" alt="" /></a></span></span></p>
<p>A visual inspection verifies that incidents cluster on lighter green tracts (lower median age) and the prediction falls along the same lines as median age is considered a significant predictor variable. &nbsp;This analysis is a bit quick and dirty since I spent so much time transforming the census data. &nbsp;I will post a more rigorous analysis of median age and other census variables as time allows.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6362312.xml</wfw:commentRss></item><item><title>Converting Lat/Lon to Zip Code</title><dc:creator>CCRI</dc:creator><pubDate>Tue, 12 Jan 2010 21:07:52 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/12/converting-latlon-to-zip-code.html</link><guid isPermaLink="false">385495:5586438:6305031</guid><description><![CDATA[<p>I noticed a question on the Analytics X Prize forum about how to determine the zip code for homicides with latitude and longitude values. &nbsp;While there are a plethora of online tools (Google Maps, etc) that will do this for you, I thought I'd describe a simple way to do it using PostgreSQL/PostGIS as it illustrates one aspect of the multitude of open source tools that aid in spatial analysis. &nbsp;Also, the described method can be easily automated in combination with a shell script and some db insert triggers.</p>
<p>First, I retrieved the incident data from the resource described in an earlier post. &nbsp;After some awk and sed wrangling, I got the data into a format where it could be imported into a PostgreSQL table with the following structure:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>philly=# \d incidents 
        Table "public.incidents"
  Column   |           Type           | Modifiers 
-----------+--------------------------+-----------
 id        | bigint                   | 
 date      | timestamp with time zone | 
 geom      | geometry                 | 
 zip       | integer                  | 
Indexes:
    "inc_gist_idx" gist (geom)</code></pre>
<p>Notice that there is a geometry column which specified the geocoded location of the homicide. &nbsp;The data came down projected using SRID 26918 - UTM Zone 18. &nbsp;I had to reproject the zip code geometries as they came as unprojected lat/lon. &nbsp;The zip code table (which I retrieved from the source listed in an earlier post) had the following structure:</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>philly=# \d philly
             Table "public.philly"
   Column   |         Type          | Modifiers 
------------+-----------------------+-----------
 gid        | integer               | 
 area       | numeric               | 
 perimeter  | numeric               | 
 zt42_d00_  | bigint                | 
 zt42_d00_i | bigint                | 
 zcta       | character varying(5)  | 
 name       | character varying(90) | 
 lsad       | character varying(2)  | 
 lsad_trans | character varying(50) | 
 the_geom   | geometry              | 
Indexes:
    "philly_gist_idx" gist (the_geom)


</code></pre>
<p>&nbsp;</p>
<p>Now, the zip code column of the incidents table is empty. &nbsp;I used the following select statement to populate the zip code column with the proper zip code which it falls in:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>update incidents set zip=
    (select cast(name as integer) from philly 
     where contains(transform(the_geom,26918),
           geom));
</code></pre>
<p>The statement is selecting the zip code name from the zip code table where the incident point falls within the zip code polygon.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6305031.xml</wfw:commentRss></item></channel></rss>
