Incanter and the GLM

with 2 Comments

I read somewhere that the Generalized Linear Model is the “workhorse of statistics” though I cannot seem to find the reference anymore.  The workhorse of statistics is so called because it unifies regression for the exponential family of probability distributions which includes Gaussian, Binomial, and Poisson distributions.  Instead of modeling the mean of the response variable, GLM models a continuous, differentiable transformation of the mean as a linear model of the predictor variables.  This transformation is called the link function and is unique for each distribution in the exponential family.  Once the distribution is specified, the model coefficients are determined via maximum likelihood estimation.  In particular, iteratively reweighted least squares of the likelihood function has been shown to converge on the MLE.

To implement the GLM in Clojure/Incanter, we first need to implement the IRLS algorithm.  If we assume that we know the link function (and its inverse, derivative, and the weight function), then IRLS is implemented as follows:

(defn irls [y X B invlink dlink weight eps]
(let [
_irls (fn [Bnext]
(let [
eta (mmult X Bnext)
mu (invlink eta)
z (plus eta (mult (minus y mu) (dlink mu)))
W (diag (weight mu))]
(mmult
(solve (mmult (trans X) W X))
(trans X) W z)))
]
(last
(last
(take-while
(fn [x] (> (euclidean-distance
(first x)
(last x)) eps))
(partition 2 1 (iterate _irls B)))))))

In the above code, we define the update step as an internal function of the updated coefficients variable.  Then, we iterate over an infinite sequence of updates until the condition that the euclidean distance between successive iterations is less than epsilon.

Next, we need to define the link functions and other associated functions of each member of the exponential family of distributions.  I have shown Gaussian and Binomial distributions below:

(defstruct family :link :invlink :dlink :weight)
(def families
{
:gaussian (struct-map family
:link (fn [x] x)
:invlink (fn [x] x)
:dlink (fn [x] 1)
:weight (fn [mu] (repeat (length mu) 1)))
:binomial (struct-map family
:link (fn [x] (log (div x (minus 1 x))))
:invlink (fn [x] (div (exp x) (plus 1 (exp x))))
:dlink (fn [x] [x] (div 1 (mult x (minus 1 x))))
:weight (fn [mu] (to-vect (mult mu (minus 1 mu)))))
})

I have used the struct-map technique from Clojure which gives me a sort of family type.  Additional families would be specified here.  Now, similar to R, we can pass the family type to a general GLM function and have one estimation technique (the IRLS defined above) for all families.  The GLM function is shown:

(defn glm
([y X & opts]
(let [opts (when opts (apply assoc {} opts))
family (or (families (:family opts))
(:gaussian families))
intercept (or (:intercept opts) true)
eps (or (:eps opts) 0.01)
bstart (:bstart opts)]
(irls y
X
bstart (:invlink family)
(:dlink family)
(:weight family)
eps))))

The GLM function simply delegates to the IRLS function with the distribution specific link, inverse link, etc functions.

To test the GLM, I used the example from the Incanter linear-model documentation:

user=> (use '(incanter core stats datasets charts))
nil
user=> (def iris
(to-matrix (get-dataset :iris) :dummies true))
#'user/iris
user=> (def y (sel iris :cols 0))
#'user/y
user=> (def x (sel iris :cols (range 1 6)))
#'user/x
user=> (def iris-lm (linear-model y x))
#'user/iris-lm
user=> (:coefs iris-lm)
(2.171266292153149 0.4958889383890437 0.8292439122349187 -0.31515517332664444 -1.0234978144907245 -0.7235619577805039)

Now, does the GLM with the Gaussian family give the same coefficients?  First, we add an intercept column.

user=> (def x (bind-columns (repeat 150 1) x))
#'user/x
user=> (glm y x
:bstart (matrix [1 1 1 1 1 1])
:family :gaussian)
[ 2.1713
0.4959
0.8292
-0.3152
-1.0235
-0.7236]

Finally, to test the binomial family, I used the “infert” dataset from R:

user=> (def sp (matrix [2 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 2 1 1 2 2 2 2 0 1 0 0 2 0 2 1 2 0 1 2 0 0 1 0 0 2 0 0 2 2 2 1 1 2 2 0 2 1 2 2 1 1 2 0 1 1 2 2 0 0 1 1 2 2 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 2 0 1 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 0 0 1 1 0 0 0 2 0 2 0 1 0 1 1 1 0 2 0 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 2 1 0 1 1 1 0 0 1 1]))
#'user/sp
user=> (def in (matrix [1 1 2 2 1 2 0 0 0 0 1 2 1 2 1 2 2 0 2 0 0 2 0 0 1 0 0 0 1 2 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 2 2 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 2 0 0 2 0 2 0 2 1 0 2 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 1 0 0 0 0 0 2 1 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 2 0 0 0 0 0 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 1 2 1 1 2 2 2 0 1 0 2 1 0 1 1 1 0 1 0 1 0 2 0 1 0 1 0 0 1 1 0 0 0 0 2 0 0]))
#'user/in
user=> (def case (matrix [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]))
#'user/case
user=> (def X (bind-columns (repeat (length sp) 1) sp in))
#'user/X
user=> (glm case X
:bstart (matrix [0 1 1])
:family :binomial
:eps 0.001)
[-1.7078
1.1972
0.4182]

 

Share this post: Facebooktwitterlinkedin
Follow CCRi:     Facebooktwitterlinkedinrss

2 Responses

  1. sozlaris
    | Reply

    Insanely cool.

    We just need to work out how to implement the algorithm so that datasets to big to fit in memory can be used (like SAS). This is where R falls down of course.

    Less of an issue these days if you have a 64bit machine with large RAM but still an issue.

    Then the next challenge will be to implement Dynamic GLMs, Vector GLMs, GLMMs, HGLMS, GAMLSS, most of which are available for R.

    But still an awesomely cool illustration of clojure/incanter. Such little code for GLMs.

  2. Gregory Burd
    | Reply

    Clojure data structures (list, map, set) are all backed by Java Collections under the hood. There is one fairly simple way that any Java Collection can be extended such that it can be arbitrarily large, by using the Berkeley DB Java Edition Collections API. Berkeley DB implements a persistent B-Tree with transactions with a key/value basic API. But, also included in Berkeley DB is a full implementation of the Collections API on top of the B-Tree. Using that API your collection lives in the B-Tree, stored in files on the local disk. Only portions of the collection are in memory at any given time. Berkeley DB Java Edition maintains a LRU cache of the data you are operating on. That which fits into a pre-determined cache size (percentage of the Java Heap) is in memory, the rest is store on disk. If Clojure/Incanter incorporated an option to use Berkeley DB Java Edition in this way you could operate on as large a collection as you have storage space.

    Disclosure: My day-job is Product Manager at Oracle for Berkeley DB. But, don't let that bias you. The license for Berkeley DB Java Edition is liberal enough to use in conjunction with Clojure.

Leave a Reply