I recently pushed a very alpha Solr plugin to GitHub that does unsupervised clustering on unstructured text documents. The plugin is written in Clojure and utilizes the Incanter and associated Parallel Colt libraries. Solr/Lucene builds an inverted index of term to document mappings. This inverted index is exploited to perform Latent Semantic Analysis. In a nutshell, LSA attempts to extract concepts from a term-document matrix. A term-document matrix contains elements that indicate the frequency or some weighting of the frequency of terms in a document. The key to LSA is rank reduction which is performed by extracting the Singular Value Decomposition of the term-document matrix. The k highest singular values are selected from the SVD and the document-concept and term-concept matrices are reduced to rank k. This has the effect of reducing noise due to extraneous words which in turn leads to better clustering. In a subsequent post, I will discuss how to measure the performance of this algorithm.

I have tested the algorithm on 20 Newsgroups data set. I started with only two newsgroups to see how well the algorithm performed. The following chart shows the two sets of documents projected into two dimensions of the concept space.

The blue points represent documents from the sci.space newsgroup and the red points from the rec.sports.baseball newsgroup. One can see that the algorithm has effectively separated these two groups in the concept space. There is some overlap in the center as well as some outliers. As a result of the overlap, there was some mis-classification. However, the actual clustering implemented so far is not very sophisticated. It simply selects the most similar centroid based on cosine similarity. A more effective clustering implementation would involve agglomerative clustering or some form of k-means clustering.

Another nice effect of SVD is the ability to extract the concept vectors. These serve to characterize the clusters. One can use these concept vectors to induce labels or to profile clusters. Some of the concept vectors for the above example are:

- us mission abort firm pegasus data pacastro system communic m contract ventur servic probe commerci market space satellit launch
- homer win astro saturday eighth friday sunday hit doublehead klein cub second third home game run score inning doubl

These are just two of the concept vectors. There are k concept vectors where k is the specified reduced rank supplied to the LSA algorithm. The next step is to map the cluster centroids to the concept vectors.

Currently, the LSA algorithm uses Parallel Colt’s SVD so the matrix algebra is done in-memory. This means that it will only work for small numbers (300-500) of documents. The next step is to investigate moving to Apache Mahout’s distributed matrix library.

Share this post:Follow CCRi:

## 9 Responses

## Jake Mannix

Anthony,

This is super-cool. I know nothing about clojure (or other Lisp variants in general), but I wrote Mahout’s DistributeLanczosSolver and DistributedRowMatrix (whose timesSquared(Vector) and times(Vector) methods are the meat of the action), so I’m very interested to see how you use it, and if/where you find any issues.

We’d love to have some of this stuff *in* Mahout as well – we’re not adverse to clojure contributions at all, we’re just not experts in it currently (and being on the JVM means inclusion is very simple for us).

Any thought to contributing this to Mahout (I see it’s already Apache Licensed)? We could get it directly in our main releases, and it would benefit from any incremental improvements and other related Mahout capabilities (like doing k-means or canopy clustering on the SVD-projected output). Plus, Clojure’s REPL could be one of many CLI’s Mahout could use if we had it hooked in properly…

-jake mannix ( @pbrane )

## Ted Dunning

I second Jake’s comments.

Please pop in on the Mahout mailing list if you want to talk about how to scale the SVD operation. Even just using in-memory operations, you should be able to get up to hundreds of thousands of documents.

## Heb

Hello,

Can you describe how did you induce the concept labels? You just take the top n terms in the term-concept matrix (U if M = U S V), right?

Thanks!

## Matt Chaput

Given that if the dots weren’t color coded, no-one would be able to discern two groupings, I wonder if you can really say the algorithm has "effectively separated these two groups in the concept space". Considering how different the subject matter of the two newsgroups is, I wouldn’t expect the clusters to be right next to each other.

## CCRI

Matt – The plot is a projection from a high dimensional space down into a two dimensional space. The actual clustering is conducted using bottom up hierarchical clustering in this high dimensional space and the documents separate out quite well. The projection plot shown is just for visualization purposes, but in terms of euclidean distance the points do separate along the concept axis of highest variance. Once you pick a number of clusters, then you can use either k-means or hierarchical methods to separate the vectors into that number of clusters. Picking the right number of clusters is a hard problem. But, you are right, there always will be some amount of error in unsupervised learning problems.

## Lance Norskog

Hi-

This is really cool! I haven’t seen anything on the solr mailing lists about it.

Lance

## Bobby W.

Hi,

This looks great. I'm having problems building though due to what I believe are dependencies on several SNAPSHOT artifacts that are no longer available. Namely, "lein deps" fails to resolve the following dependencies:

org.apache.mahout:mahout-core:jar:0.4-SNAPSHOT

org.clojure:clojure:jar:1.2.0-master-SNAPSHOT

org.apache.mahout:mahout-math:jar:0.4-SNAPSHOT

org.apache.mahout:mahout-core:jar:0.4-SNAPSHOT

from any of the following repositories:

central (http://repo1.maven.org/maven2),

clojars (http://clojars.org/repo/),

apache (https://repository.apache.org/)

Any chance you either have these lying around and can provide them, or have updated code that changes these to non-SNAPSHOT artifacts?

## Sid

I recently upteadd the script for this program using the canvas element of HTML5. Now the algorithm shows the solution output by the k-means clustering algorithm with the points and clusters on a two dimensional coordinate plane.

## Gema

I didn’t understand:One such diffnreece is that Node.js only allows you to write code in JavaScript whereas Vert.x allows you to code in Java, Groovy, JavaScript, Ruby and eventually as the team expands: Clojure, Scala and PythonWhat’s so special in the last 3 products for bigger teams?