The options for the first step of document clustering

Question

I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.

I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.

LSA family of methods count on sets of vectors (matrices), so no, you can't change representation without changing the method. You can, however, change matrix implementation. See [Wikipedia](http://en.wikipedia.org/wiki/Sparse_matrix) for some common sparse matrix representations. — ffriend, Feb 11 '14 at 12:58
Thanks, but I am not limiting it to the LSA family of methods. — smwikipedia, Feb 11 '14 at 13:16
Then you need to define your goal first. For clustering you need a notion of distance/similarity - some number that will show how much 2 documents are close (similar) to each other in your specific context. LSA-like methods assume semantic similarity and rely on word counts, but in your settings you may be interested in style similarity, or opinion similarity, or length-of-word similarity or whatever. So it's not really about data representation, but rather about what you want and what methods fit better. — ffriend, Feb 11 '14 at 13:55

score 3 · Answer 1 · edited May 23 '17 at 10:25

As @ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:

feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.

For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".

score 1 · Answer 2 · answered Feb 11 '14 at 13:34

You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.

Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.

score 1 · Answer 3 · answered Feb 14 '14 at 12:48

OK, this is quite a very general question, and many answers are possible, none is definitive because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.

You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.

Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others: http://jgibblda.sourceforge.net/

The options for the first step of document clustering

3 Answers3