I recently pushed a very alpha Solr plugin to GitHub that does unsupervised clustering on unstructured text documents. The plugin is written in Clojure and utilizes the Incanter and associated Parallel Colt libraries. Solr/Lucene builds an inverted index of term to document mappings. This inverted index is exploited to perform Latent Semantic Analysis. In a nutshell, LSA attempts to extract concepts from a term-document matrix. A term-document matrix contains elements that indicate the frequency or some weighting of the frequency of terms in a document. The key to LSA is rank reduction which is performed by extracting the Singular Value Decomposition of the term-document matrix. The k highest singular values are selected from the SVD and the document-concept and term-concept matrices are reduced to rank k. This has the effect of reducing noise due to extraneous words which in turn leads to better clustering. In a subsequent post, I will discuss how to measure the performance of this algorithm.
I have tested the algorithm on 20 Newsgroups data set. I started with only two newsgroups to see how well the algorithm performed. The following chart shows the two sets of documents projected into two dimensions of the concept space.
The blue points represent documents from the sci.space newsgroup and the red points from the rec.sports.baseball newsgroup. One can see that the algorithm has effectively separated these two groups in the concept space. There is some overlap in the center as well as some outliers. As a result of the overlap, there was some mis-classification. However, the actual clustering implemented so far is not very sophisticated. It simply selects the most similar centroid based on cosine similarity. A more effective clustering implementation would involve agglomerative clustering or some form of k-means clustering.
Another nice effect of SVD is the ability to extract the concept vectors. These serve to characterize the clusters. One can use these concept vectors to induce labels or to profile clusters. Some of the concept vectors for the above example are:
- us mission abort firm pegasus data pacastro system communic m contract ventur servic probe commerci market space satellit launch
- homer win astro saturday eighth friday sunday hit doublehead klein cub second third home game run score inning doubl
These are just two of the concept vectors. There are k concept vectors where k is the specified reduced rank supplied to the LSA algorithm. The next step is to map the cluster centroids to the concept vectors.
Currently, the LSA algorithm uses Parallel Colt’s SVD so the matrix algebra is done in-memory. This means that it will only work for small numbers (300-500) of documents. The next step is to investigate moving to Apache Mahout’s distributed matrix library.Share this post: