<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Squarespace Site Server v5.11.5 (http://www.squarespace.com/) on Fri, 30 Jul 2010 11:22:10 GMT--><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><title>Blog</title><link>http://www.ccri.com/blog/</link><description></description><lastBuildDate>Fri, 02 Apr 2010 16:07:41 +0000</lastBuildDate><copyright></copyright><language>en-US</language><generator>Squarespace Site Server v5.11.5 (http://www.squarespace.com/)</generator><item><title>Latent Semantic Analysis in Solr using Clojure</title><dc:creator>CCRI</dc:creator><pubDate>Fri, 02 Apr 2010 15:14:07 +0000</pubDate><link>http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html</link><guid isPermaLink="false">385495:5586438:7211797</guid><description><![CDATA[<p>I recently pushed a very alpha <a href="http://github.com/algoriffic/lsa4solr">Solr plugin</a>&nbsp;to GitHub that does unsupervised clustering on unstructured text documents. &nbsp;The plugin is written in Clojure and utilizes the Incanter and associated Parallel Colt libraries. &nbsp;Solr/Lucene builds an inverted index of term to document mappings. &nbsp;This inverted index is exploited to perform <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Analysis</a>. &nbsp;In a nutshell, LSA attempts to extract concepts from a term-document matrix. &nbsp;A term-document matrix contains elements that indicate the frequency or some weighting of the frequency of terms in a document. &nbsp;The key to LSA is rank reduction which is performed by extracting the <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a>&nbsp;of the term-document matrix. &nbsp;The k highest singular values are selected from the SVD and the document-concept and term-concept matrices are reduced to rank k. &nbsp;This has the effect of reducing noise due to extraneous words which in turn leads to better clustering. &nbsp;In a subsequent post, I will discuss how to measure the performance of this algorithm.</p>
<p>I have tested the algorithm on <a href="http://people.csail.mit.edu/jrennie/20Newsgroups/">20 Newsgroups</a>&nbsp;data set. &nbsp;I started with only two newsgroups to see how well the algorithm performed. &nbsp;The following chart shows the two sets of documents projected into two dimensions of the concept space.</p>
<p><span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/science_baseball.png?__SQUARESPACE_CACHEVERSION=1270223265849" alt="" /></span></span></p>
<p>The blue points represent documents from the sci.space newsgroup and the red points &nbsp;from the rec.sports.baseball newsgroup. &nbsp;One can see that the algorithm has effectively separated these two groups in the concept space. &nbsp;There is some overlap in the center as well as some outliers. &nbsp;As a result of the overlap, there was some mis-classification. &nbsp;However, the actual clustering implemented so far is not very sophisticated. &nbsp;It simply selects the most similar centroid based on cosine similarity. &nbsp;A more effective clustering implementation would involve agglomerative clustering or some form of k-means clustering.</p>
<p>Another nice effect of SVD is the ability to extract the concept vectors. &nbsp;These serve to characterize the clusters. &nbsp;One can use these concept vectors to induce labels or to profile clusters. &nbsp;Some of the concept vectors for the above example are:</p>
<ul>
<li>﻿us mission abort firm pegasus data pacastro system communic m contract ventur servic probe commerci &nbsp;market space satellit launch﻿﻿﻿</li>
<li>homer win astro saturday eighth friday sunday hit doublehead klein cub second third home game run score inning doubl</li>
</ul>
<p>These are just two of the concept vectors. &nbsp;There are k concept vectors where k is the specified reduced rank supplied to the LSA algorithm. &nbsp;The next step is to map the cluster centroids to the concept vectors.</p>
<p>Currently, the LSA algorithm uses Parallel Colt's SVD so the matrix algebra is done in-memory. &nbsp;This means that it will only work for small numbers (300-500) of documents. &nbsp;The next step is to investigate moving to Apache Mahout's distributed matrix library.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-7211797.xml</wfw:commentRss></item><item><title>PostGIS BBOX Query Gotcha</title><dc:creator>CCRI</dc:creator><pubDate>Fri, 19 Feb 2010 14:34:32 +0000</pubDate><link>http://www.ccri.com/blog/2010/2/19/postgis-bbox-query-gotcha.html</link><guid isPermaLink="false">385495:5586438:6755740</guid><description><![CDATA[<p>I got stung by this one after processing quite a bit of data. &nbsp;When doing a nearest neighbor search, I have been leveraging the GiST index functionality in PostGIS. &nbsp;The documentation describes how to <a href="http://postgis.refractions.net/docs/ch04.html#id2806074">take advantage</a> of these indexes be using the &amp;&amp; operator to first find overlapping bounding boxes and then do the compute intensive calculation on the smaller subset of matched features. &nbsp;However, there is a condition in which the overlapping bounding box operator does not return the nearest features. &nbsp;Perhaps this is well know, but I got hit by it.</p>
<p>Consider the case of searching for the nearest road from a point on a map. &nbsp;A natural way of performing this search is to expand a bounding box around the point and use the &amp;&amp; operator to select the roads that intersect that bounding box. &nbsp;Then, the distance to each of those roads in the returned subset is computed and the minimum distance is returned. &nbsp;Observe the following scenario:</p>
<p><span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/gotcha.png?__SQUARESPACE_CACHEVERSION=1266592927783" alt="" /></span></span></p>
<p>The tiny square in the upper right (just below and to the right of the rectangle) is the bbox of the point from which we wish to find the nearest road. &nbsp;The yellow rectangle is the bbox of the nearest road. &nbsp;The large transparent blue rectangle is a bbox of the next nearest road. &nbsp;So the only overlapping bbox for the point is the large rectangle. &nbsp;Thus, the &amp;&amp; operator does not find the nearest road and our calculation is wrong.</p>
<p>The solution I have come up with so far is to use the bbox operator, compute the nearest distance, and use a new bbox expanded around the point using the just computed distance. &nbsp;This operation will find any overlapping bbox within that range and will come up with the correct nearest road. &nbsp;I don't like this solution as it requires two &amp;&amp; searches and multiple distance computations - not very optimal.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6755740.xml</wfw:commentRss></item><item><title>Incanter and the GLM</title><dc:creator>CCRI</dc:creator><pubDate>Wed, 17 Feb 2010 15:54:06 +0000</pubDate><link>http://www.ccri.com/blog/2010/2/17/incanter-and-the-glm.html</link><guid isPermaLink="false">385495:5586438:6724641</guid><description><![CDATA[<p>I read somewhere that the <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">Generalized Linear Model</a> is the "workhorse of statistics" though I cannot seem to find the reference anymore. &nbsp;The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions. &nbsp;Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables. &nbsp;This transformation is called the link function and is unique for each distribution in the exponential family. &nbsp;Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation. &nbsp;In particular, <a href="http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares">iteratively reweighted least squares</a> of the likelihood function has been shown to converge on the MLE.</p>
<p>To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm. &nbsp;If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn irls [y X B invlink dlink weight eps]
  (let [
    _irls (fn [Bnext] 
          (let [
        eta (mmult X Bnext)
        mu (invlink eta)
        z (plus eta (mult (minus y mu) (dlink mu)))
        W (diag (weight mu))]
        (mmult 
         (solve (mmult (trans X) W X)) 
         (trans X) W z)))
    ]
    (last 
     (last 
      (take-while 
       (fn [x] (&gt; (euclidean-distance 
           (first x) 
           (last x)) eps)) 
       (partition 2 1 (iterate _irls B)))))))</code></pre>
<p>In the above code, we define the update step as an internal function of the updated coefficients variable. &nbsp;Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.</p>
<p>Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions. &nbsp;I have shown Gaussian and Binomial&nbsp;distributions below:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defstruct family :link :invlink :dlink :weight)
(def families
     {
     :gaussian (struct-map family
           :link (fn [x] x)
           :invlink (fn [x] x)
           :dlink (fn [x] 1)
           :weight (fn [mu] (repeat (length mu) 1)))
     :binomial (struct-map family
           :link (fn [x] (log (div x (minus 1 x))))
           :invlink (fn [x] (div (exp x) (plus 1 (exp x))))
           :dlink (fn [x] [x] (div 1 (mult x (minus 1 x))))
           :weight (fn [mu] (to-vect (mult mu (minus 1 mu)))))
      })
</code></pre>
<p>I have used the struct-map technique from Clojure which gives me a sort of family type. &nbsp;Additional families would be specified here. &nbsp;Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families. &nbsp;The GLM function is shown:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn glm 
  ([y X &amp; opts]
      (let [opts (when opts (apply assoc {} opts))
       family (or (families (:family opts)) 
              (:gaussian families))
       intercept (or (:intercept opts) true)
       eps (or (:eps opts) 0.01)
       bstart (:bstart opts)]
       (irls y 
         X 
         bstart (:invlink family) 
         (:dlink family) 
         (:weight family) 
         eps))))
</code></pre>
<p>The GLM function simply delegates to the IRLS function with the distribution specific link, inverse link, etc functions.</p>
<p>To test the GLM, I used the example from the Incanter linear-model documentation:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (use '(incanter core stats datasets charts))
nil
user=&gt; (def iris 
  (to-matrix (get-dataset :iris) :dummies true))
#'user/iris
user=&gt; (def y (sel iris :cols 0))
#'user/y
user=&gt; (def x (sel iris :cols (range 1 6)))
#'user/x
user=&gt; (def iris-lm (linear-model y x))
#'user/iris-lm
user=&gt; (:coefs iris-lm)
(2.171266292153149 0.4958889383890437 0.8292439122349187 -0.31515517332664444 -1.0234978144907245 -0.7235619577805039)
</code></pre>
<p>Now, does the GLM with the Gaussian family give the same coefficients? &nbsp;First, we add an intercept column.</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (def x (bind-columns (repeat 150 1) x))
#'user/x
user=&gt; (glm y x 
     :bstart (matrix [1 1 1 1 1 1]) 
     :family :gaussian)
[ 2.1713
 0.4959
 0.8292
-0.3152
-1.0235
-0.7236]
</code></pre>
<p>Finally, to test the binomial family, I used the "infert" dataset from R:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (def sp (matrix [2 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 2 1 1 2 2 2 2 0 1 0 0 2 0 2 1 2 0 1 2 0 0 1 0 0 2 0 0 2 2 2 1 1 2 2 0 2 1 2 2 1 1 2 0 1 1 2 2 0 0 1 1 2 2 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 2 0 1 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 0 0 1 1 0 0 0 2 0 2 0 1 0 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 1 0 1 1 1 0 0 1 1]))
#'user/sp
user=&gt; (def in (matrix [1 1 2 2 1 2 0 0 0 0 1 2 1 2 1 2 2 0 2 0 0 2 0 0 1 0 0 0 1 2 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 2 2 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 2 0 0 2 0 2 0 2 1 0 2 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 1 0 0 0 0 0 2 1 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 2 0 0 0 0 0 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 1 2 1 1 2 2 2 0 1 0 2 1 0 1 1 1 0 1 0 1 0 2 0 1 0 1 0 0 1 1 0 0 0 0 2 0 0]))
#'user/in
user=&gt; (def case (matrix [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]))
#'user/case
user=&gt; (def X (bind-columns (repeat (length sp) 1) sp in))
#'user/X
user=&gt; (glm case X 
  :bstart (matrix [0 1 1]) 
  :family :binomial 
  :eps 0.001)
[-1.7078
 1.1972
 0.4182]

</code></pre>
<p>&nbsp;</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6724641.xml</wfw:commentRss></item><item><title>Monte Carlo Pi calc</title><dc:creator>CCRI</dc:creator><pubDate>Wed, 27 Jan 2010 17:26:45 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/27/monte-carlo-pi-calc.html</link><guid isPermaLink="false">385495:5586438:6443570</guid><description><![CDATA[<p>What is the first app that you code up in a new language that you are learning? &nbsp;I imagine most people start with the canonical "Hello World" and then move on to their own specific app. &nbsp;A colleague of mine always codes up the Mandelbrot set which typically involves implementing a complex number class with its associated operations - good for OO languages. &nbsp;For mathematical and statistical languages and APIs, I always start with Monte Carlo PI calc, a simple variant of <a href="http://en.wikipedia.org/wiki/Buffon's_needle">Buffon's Needle</a> problem. &nbsp;The algorithm samples n points from a unit square and then computes the ratio of points that fall within an inscribed circle of radius .5 to the total number of samples. &nbsp;This ratio should approach the area of the inscribed circle. &nbsp;Therefore, PI can be computed as the simulated ratio divided by 0.25.</p>
<p>To continue learning Clojure and Incanter, I implemented the algorithm as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn mc-pi-calc [n] (/ (count (filter #(&lt;= %1 0.5)
 (map #(euclidean-distance (vec %1) [0.5 0.5]) 
   (partition 2 (sample-uniform (* 2 n)))))) 
     (* n 0.5 0.5)))</code></pre>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (mc-pi-calc 10000)
3.1432</code></pre>
<p>The function takes the number of samples as input and uses the sample-uniform function to generate n points in the unit square. &nbsp;Then, it counts the number of points that fall within a circle inscribed in the unit square using the euclidean distance function and divides this count by the total number of samples to get the area of the circle to area of the square ratio. &nbsp;From this, the value of PI is easily calculated.</p>
<div>
<p>The simulation can be visualized using a scatter plot from Incanter's charts API.</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>;; define the sample points
(def data (trans (map vec (partition 2 
    (sample-uniform (* 2 10000))))))

;; plot the sample points
(def p (scatter-plot (first data) (second data)))

;; overlay the points in the circle
(def data2 (trans 
   (filter #(&gt;= 0.5 
      (euclidean-distance %1 [0.5 0.5])) 
   (trans data))))
(add-points p (first data2) (rest data2))

;; view the resulting plot
(view p)
</code></pre>
<p>&nbsp;The code above produces the following chart.</p>
<p>&nbsp;<span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/mc-pi-calc.png?__SQUARESPACE_CACHEVERSION=1264635969547" alt="" /></span></span></p>
<p>The Monte Carlo Pi calc algorithm can also nicely illustrate the weak law of large numbers. &nbsp;Clojure's parallel processing pmap function came in handy for this task. &nbsp;The weak law of large numbers states that the probability that the sample average approaches the actual within some error approaches one as the number of samples approaches infinity.  So, to demonstrate that, we define one sample as one computation of Pi fixing the number of random points at 100.  Then, to obtain an estimate of the probability of the sample average being within an error fixed at 0.01, we take the average of 10 estimates of Pi 100 times and count the number that fell within the error.  We repeat this raising the number of samples each time.  The following code snippet implements the algorithm:</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(def data (pmap (fn [nsamples] 
    (take 100 (repeatedly (fn [] 
       (take nsamples 
           (repeatedly #(mc-pi-calc 100))))))) 
    (range 10 100 1)))

(pmap (fn [exp] (let 
   [d (map #(/ (sum %1) (count %1)) exp)] 
   (/ (count (filter #(&lt;= (abs (- %1 3.14159)) 0.01) d))
      (double (count d))))) 
   data)
</code></pre>
<p>&nbsp;</p>
<p>The pmap function automatically threads the computationally intensive function over the input list using all available processors. &nbsp;This meant my four core laptop churned for a while on this function. &nbsp;The nice part of pmap is that I did not have to do anything special to get this multi-threaded functionality. &nbsp;And there is no reason why pmap couldn't distribute the processing across a map-reduce cluster.</p>
<p><span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/lln.png?__SQUARESPACE_CACHEVERSION=1264686781424" alt="" /></span></span></p>
<p>Admittedly, the MC pi calc algorithm does not converge very fast but it does illustrate the simulation capabilities of Clojure and Incanter well.</p>
</div>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6443570.xml</wfw:commentRss></item><item><title>Functional programming and root finding</title><dc:creator>CCRI</dc:creator><pubDate>Sat, 23 Jan 2010 21:05:21 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/23/functional-programming-and-root-finding.html</link><guid isPermaLink="false">385495:5586438:6409489</guid><description><![CDATA[<p>I recently discovered <a href="http://incanter.org/">Incanter</a>&nbsp;which looks really promising for statistical computing on the JVM. &nbsp;Incanter is written in Clojure, a lisp like functional programming language for the JVM. &nbsp;We have been using Scala, a hybrid OO/functional programming language for the JVM, in one of our applications but I have yet to find a robust statistics API for Scala. &nbsp;We also use R in the same application.  It would be nice to stay within the JVM for statistical procedures rather than communicate between the JVM and an R session.</p>
<p>I wanted to investigate Incanter, but first I needed to wrap my head around Clojure. &nbsp;Folding and nesting are common procedures in functional programming languages and in numerical methods. &nbsp;You can find an excellent discussion of folding <a href="http://alan.dipert.org/post/307586762/polyglot-folding-ruby-clojure-scala">here</a>. &nbsp;Many algorithms follow the iterate and accumulate procedure that naturally maps to the folding and nesting paradigms. &nbsp;<a href="http://en.wikipedia.org/wiki/Newton's_method">Newton's Method</a>&nbsp;for polynomial root finding follows this paradigm as do many optimization algorithms that converge on an extrema. &nbsp;As an avid Mathematica enthusiast, I often used NestList to implement and visualize the steps of nesting algorithms, but I haven't found the equivalent built into the libraries of any of the other three languages. &nbsp;So, to dive in to Clojure and Incanter, I decided to implement the&nbsp;root finding in Clojure, Scala, and R for comparison. &nbsp;First, we need a NestList equivalent. &nbsp;Then, we can use the NestList function to implement the root finding algorithm which we'll test out on the trivial polynomial x^2-5 which obviously has its root at ~<span style="color: #444444; font-size: 9px; line-height: 14px; white-space: pre-wrap; -webkit-text-size-adjust: none;">&plusmn;2.23606797749979.</span></p>
<p>The root finding algorithm using NestList in Mathematica can be viewed <a href="http://www.wolframalpha.com/input/?i=NestList[(%23+-+((%23^2-5)/(2+%23)))%26,+1,+10]">here</a>.</p>
<p>In Clojure, the implementation and usage is as follows:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>(defn nestlist [fn iv n] (take n (iterate fn iv)))
(defn findroot [f df iv n] 
     (nestlist #(- %1 (/ (f %1) (df %1))) (double iv) n))</code></pre>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>user=&gt; (findroot #(- (* %1 %1) 5) #(* 2 %1) 1.0 20)
(1.0 3.0 2.3333333333333335 2.238095238095238 2.2360688956433634 2.236067977499978 2.23606797749979 2.23606797749979 2.23606797749979 2.23606797749979)
</code></pre>
<p>The iterate function is exactly what I was looking for. &nbsp;The lazy evaluation of the sequence makes it easy to work with and the definition of the nestlist function is just syntactic sugar on the iterate function.</p>
<p>In Scala, the implementation of nestlist uses the foldLeft function and accumulates its results&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>def nestlist(f:(Double)=&gt;Double, iv: Double, n: Int): List[Double]={
   (0 until n).foldLeft(List(f(iv)))
        ((xs,i) =&gt; xs ++ List(f(xs.head)))
}
def findroot(f:(Double)=&gt;Double,
     df:(Double)=&gt;Double,iv:Double,n:Int):List[Double]={
   nestlist((x)=&gt;x-f(x)/df(x),iv,n)
}
</code></pre>
<p>&nbsp;The accumulating list is a bit awkward. &nbsp;I'm sure there are cleaner methods for implementing nestlist in Scala.</p>
<p>&nbsp;Usage of the method:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>findroot((x)=&gt;x*x-5,(x)=&gt;2*x,1,10)
res1: List[Double] = List(3.0, 2.3333333333333335, 2.238095238095238, 2.2360688956433634, 2.236067977499978, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979, 2.23606797749979)</code></pre>
<p>In R, I had to resort to a very non-FP for loop.  I looked at the apply family of functions and replicate but couldn't come up with a good algorithm quickly so here is the result.</p>
<div id="_mcePaste">
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>nestlist=function(f, iv, n) { 
   acc=as.vector(iv)
   for(e in 1:n) acc=append(acc, f(acc[e]))
   acc
}
findroot=function(f, df, iv, n) 
   nestlist(function(x) x-f(x)/df(x), iv, n)</code></pre>
</div>
<p>And its usage:</p>
<div id="_mcePaste">
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>&gt; findroot(function(x) x^2-5, function(x) 2*x, 1, 10)
 [1] 1.000000 3.000000 2.333333 2.238095 2.236069 2.236068 2.236068 2.236068
 [9] 2.236068 2.236068 2.236068
</code></pre>
</div>
<p>Next steps are to start playing around with Incanter and implement some statistical procedures that utilize the root finding algorithm.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6409489.xml</wfw:commentRss></item><item><title>Python Static Dictionaries in Nearest Neighbor Queries</title><dc:creator>CCRI</dc:creator><pubDate>Thu, 21 Jan 2010 14:19:33 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/21/python-static-dictionaries-in-nearest-neighbor-queries.html</link><guid isPermaLink="false">385495:5586438:6388976</guid><description><![CDATA[<p>A standard query on geospatial data is the nearest neighbor query, i.e. Select the five closest police stations from a given point. &nbsp;The brute force approach to this problem is joining the two tables spatially and sorting by distance limiting the result to the number requested. &nbsp;Of course, for very large tables, this is extremely costly. &nbsp;That's where spatial indexes come in. &nbsp;PostGIS implements <a href="http://postgis.refractions.net/docs/ch04.html#id2717711">GiST indexes</a>&nbsp;which are a general form of index that is capable of handling any kind of data with user defined keys. &nbsp;In the case of a nearest neighbor query, the index is used to narrow down the number of items to perform a distance calculation against. &nbsp;This vastly improves the performance of a nearest neighbor query. &nbsp;A very effective algorithm for nearest neighbor queries can be found <a href="http://www.bostongis.org/?content_name=postgis_nearest_neighbor_generic">here</a>. &nbsp;This algorithm effectively grows the search area in a smart fashion until the right number of features are captured. &nbsp;This reduces the overall number of distance calculations required. &nbsp;The user still needs to provide an initial box in which to search, the smaller the better since it will grow.</p>
<p>There are situations in which one would like to know all the nearest neighbors between one feature and another feature. &nbsp;This could be useful in diverse areas such as real estate planning (what is the best place to develop given proximity requirements to schools and grocery stores?) to disaster management (what is the most accessible and safest point between these hardest hit areas?). &nbsp;The nearest neighbor query now has to proceed sequentially along all geometries of one feature computing nearest neighbors to all geometries along another feature. &nbsp;This becomes computationally intractable as the size of the feature tables grow.</p>
<p>An easy way to improve this type of query is to exploit the spatial proximity of adjacent features in a table. &nbsp;Using a smart sequential scan, the algorithm "remembers" the nearest neighbor distance from the last feature and uses that as the initial box size for the next feature. &nbsp;The Python language extension provides a static dictionary that a stored procedure has access to to facilitate this kind of operation. &nbsp;This may be quite obvious to everyone and of course it is plainly listed in the Postgres <a href="http://www.postgresql.org/docs/8.4/static/plpython-funcs.html">documentation</a>, but somehow I still managed to overlook it for the longest time while trying to solve this problem. &nbsp;The nearest neighbor distance can be stored and retrieved as in the following code block:</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>create or replace function nn(geom geometry, 
   featuretable text, 
   geomcol text, 
   initdist double precision, n int) 
   returns double precision
AS $$
    curdist = initdist
    key = featuretable
    if SD.has_key(key):
            curdist = int(SD[key])
    else:
            SD[key] = initdist

    // perform nearest neighbor query using curdist

$$ LANGUAGE plpythonu;
</code></pre>
<p>Keep in mind that this is not very threadsafe. &nbsp;Two simultaneous nearest neighbor queries on the same table will interfere with each others' stored distance. &nbsp;I'd imagine that one could use some data about the query plan to store the distance uniquely, but that is beyond my Postgres skills for the moment.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6388976.xml</wfw:commentRss></item><item><title>Median Age as Predictor Variable</title><dc:creator>CCRI</dc:creator><pubDate>Mon, 18 Jan 2010 22:12:10 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/18/median-age-as-predictor-variable.html</link><guid isPermaLink="false">385495:5586438:6362312</guid><description><![CDATA[<p>There is a ton of information in the TIGER Census files at the U.S. Gov Census <a href="http://www.census.gov/geo/www/tiger/tgrshp2009/tgrshp2009.html">site</a>. &nbsp;Unfortunately, it is not easily mapped to geolocations. &nbsp;I had to get the tract level shapefiles and then transform the variables in the data files so that the variables lined up with the tracts. &nbsp;Once I clean up the scripts that I used to do this transformation, I will post them.</p>
<p>The following map shows a section of Philadelphia with zip codes labelled. &nbsp;The median age is shown color coded where lighter green indicates a younger median age and blue means an older median age. &nbsp;I wanted to determine if median age is correlated with homicides. &nbsp;If it turns out that median age is correlated, then law enforcement could use this information to update deployment allocations when a new census comes out. &nbsp;Homicides are marked using a star symbol and are shown for June 2009 to December 2009. &nbsp;</p>
<p><br /><span class="thumbnail-image-block ssNonEditable"><span><a href="javascript:showFullImage('/display/ShowImage?imageUrl=%2Fstorage%2Fwithoutpred.png%3F__SQUARESPACE_CACHEVERSION%3D1263852778659',976,1656);"><img src="http://www.ccri.com/storage/thumbnails/4165020-5424179-thumbnail.jpg?__SQUARESPACE_CACHEVERSION=1263852778660" alt="" /></a></span></span></p>
<p>When I included median age in my model, it came out as a significant predictor. &nbsp;I generated a prediction for the following week using the model that includes median age.</p>
<p><span class="thumbnail-image-block ssNonEditable"><span><a href="javascript:showFullImage('/display/ShowImage?imageUrl=%2Fstorage%2Fheatmap.png%3F__SQUARESPACE_CACHEVERSION%3D1263853313454',976,1656);"><img src="http://www.ccri.com/storage/thumbnails/4165020-5424262-thumbnail.jpg?__SQUARESPACE_CACHEVERSION=1263853313455" alt="" /></a></span></span></p>
<p>A visual inspection verifies that incidents cluster on lighter green tracts (lower median age) and the prediction falls along the same lines as median age is considered a significant predictor variable. &nbsp;This analysis is a bit quick and dirty since I spent so much time transforming the census data. &nbsp;I will post a more rigorous analysis of median age and other census variables as time allows.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6362312.xml</wfw:commentRss></item><item><title>Converting Lat/Lon to Zip Code</title><dc:creator>CCRI</dc:creator><pubDate>Tue, 12 Jan 2010 21:07:52 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/12/converting-latlon-to-zip-code.html</link><guid isPermaLink="false">385495:5586438:6305031</guid><description><![CDATA[<p>I noticed a question on the Analytics X Prize forum about how to determine the zip code for homicides with latitude and longitude values. &nbsp;While there are a plethora of online tools (Google Maps, etc) that will do this for you, I thought I'd describe a simple way to do it using PostgreSQL/PostGIS as it illustrates one aspect of the multitude of open source tools that aid in spatial analysis. &nbsp;Also, the described method can be easily automated in combination with a shell script and some db insert triggers.</p>
<p>First, I retrieved the incident data from the resource described in an earlier post. &nbsp;After some awk and sed wrangling, I got the data into a format where it could be imported into a PostgreSQL table with the following structure:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>philly=# \d incidents 
        Table "public.incidents"
  Column   |           Type           | Modifiers 
-----------+--------------------------+-----------
 id        | bigint                   | 
 date      | timestamp with time zone | 
 geom      | geometry                 | 
 zip       | integer                  | 
Indexes:
    "inc_gist_idx" gist (geom)</code></pre>
<p>Notice that there is a geometry column which specified the geocoded location of the homicide. &nbsp;The data came down projected using SRID 26918 - UTM Zone 18. &nbsp;I had to reproject the zip code geometries as they came as unprojected lat/lon. &nbsp;The zip code table (which I retrieved from the source listed in an earlier post) had the following structure:</p>
<p>&nbsp;</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>philly=# \d philly
             Table "public.philly"
   Column   |         Type          | Modifiers 
------------+-----------------------+-----------
 gid        | integer               | 
 area       | numeric               | 
 perimeter  | numeric               | 
 zt42_d00_  | bigint                | 
 zt42_d00_i | bigint                | 
 zcta       | character varying(5)  | 
 name       | character varying(90) | 
 lsad       | character varying(2)  | 
 lsad_trans | character varying(50) | 
 the_geom   | geometry              | 
Indexes:
    "philly_gist_idx" gist (the_geom)


</code></pre>
<p>&nbsp;</p>
<p>Now, the zip code column of the incidents table is empty. &nbsp;I used the following select statement to populate the zip code column with the proper zip code which it falls in:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>update incidents set zip=
    (select cast(name as integer) from philly 
     where contains(transform(the_geom,26918),
           geom));
</code></pre>
<p>The statement is selecting the zip code name from the zip code table where the incident point falls within the zip code polygon.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6305031.xml</wfw:commentRss></item><item><title>Second Pass at Analytics X Prize</title><dc:creator>CCRI</dc:creator><pubDate>Mon, 11 Jan 2010 18:49:41 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/11/second-pass-at-analytics-x-prize.html</link><guid isPermaLink="false">385495:5586438:6292837</guid><description><![CDATA[<p>For my second attempt at predicting homicides in Philadelphia, I included roads in the model. &nbsp;I got the roads data from the census link in the last post, imported the roads into my PostgreSQL/PostGIS database, and visualized the resulting prediction using Quantum GIS connected to my PostGIS store. &nbsp;The following image was exported from QGIS:</p>
<p>&nbsp;<span class="full-image-block ssNonEditable"><span><img style="width: 400px;" src="http://www.ccri.com/storage/run2.png?__SQUARESPACE_CACHEVERSION=1263243820633" alt="" /></span></span>&nbsp;</p>
<p>The image shows the Philadelphia zip codes, the local roads network of Philadelphia, and the new prediction as a red gradient where the deeper the red the more likely a crime will occur at that location according to the prediction.</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6292837.xml</wfw:commentRss></item><item><title>Philadelphia Data Resources</title><dc:creator>CCRI</dc:creator><pubDate>Mon, 11 Jan 2010 14:26:11 +0000</pubDate><link>http://www.ccri.com/blog/2010/1/11/philadelphia-data-resources.html</link><guid isPermaLink="false">385495:5586438:6291101</guid><description><![CDATA[<p>The following online sources have proved useful in retrieving spatial data for the Philadelphia area:</p>
<ul>
<li><a href="http://www.pasda.psu.edu/default.asp">http://www.pasda.psu.edu/default.asp</a></li>
<li><a href="http://www2.census.gov/cgi-bin/shapefiles/state-files?state=42">http://www2.census.gov/cgi-bin/shapefiles/state-files?state=42</a></li>
<li>http://citymaps.phila.gov/CrimeMap/MappingServices.asmx/GetCrimePtsByCriteria</li>
</ul>
<p>I used the following curl command to pull geocoded incident data from the last link above:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%;"><code>curl -H 'Content-type:application/json' -H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -c cookie 
-d '{"crimeclass":"10","from":'$s',"to":'$e',"maxBounds":"2568448,138838,2850283,362526","curExtentArea":19468800000}' 
http://citymaps.phila.gov/CrimeMap/MappingServices.asmx/GetCrimePtsByCriteria
</code></pre>
<p>I had to store a cookie in the file 'cookie' which I captured using Firebug. &nbsp;Also, $s and $e are date variables specifying the start and end date for the query.</p>
<p>In a followup post, I will discuss how I am using PostgreSQL/PostGIS and Quantum GIS to view and transform the data found on the aforementioned sites into a format useful for analysis.</p>
<p>&nbsp;</p>]]></description><wfw:commentRss>http://www.ccri.com/blog/rss-comments-entry-6291101.xml</wfw:commentRss></item></channel></rss>