6

This is more of a conceptual question than an actual implementation and am hoping someone could clarify. My goal is the following: Given a set of documents, I want to cluster them such that documents belonging to the same cluster have the same "concept".

From what I understand, Latent Semantic Analysis lets me find a low rank approximation of a term-document matrix i.e. given a matrix X, it will decompose X as a product of three matrices, out of which one would be a diagonal matrix Σ:

SVD

Now, I would proceed by choosing a low rank approximation i.e. choose only the top-k values from Σ, and then calculate X'. Once I have this matrix, I have to apply some clustering algorithm and the end result would be set of clusters grouping documents with similar concepts. Is this the right way of applying clustering? I mean, calculating X' and then applying clustering on top of it or is there some other method that is followed?

Also, in a somewhat related question of mine, I was told that the meaning of a neighbor is lost as the number of dimensions increases. In that case, what is the justification for clustering these high dimensional data points from X'? I am guessing that the requirement to cluster similar documents is a real-world requirement in which case, how does one go about addressing this?

Community
  • 1
  • 1
Legend
  • 113,822
  • 119
  • 272
  • 400

1 Answers1

4

For your first part of your question: No, you do not need to perform any 'clustering' anymore. Such clustering is already available from your singular value decomposition. If this is still unclear, please study more on detailed manner your link Latent Semantic Analysis.

For your second part: please just figure out the first part of your question and then restate this part of your question based on that.

eat
  • 7,440
  • 1
  • 19
  • 27
  • Thank you. So you mean that I truncate Vt by k rows and then compare columns or maybe run k-means on the columns to get the final clusters? Just to make it clear, I do not have any query documents. I a trying to cluster the original documents. I did read the article except that I am getting confused when nearing the end. – Legend Jul 07 '11 at 22:01
  • I restate, AFAIU, no post clustering is needed. Classifications are based either on X^TX or XX^T. More or less, just substitute for X= U_k* S_k* V_k^T (where U_k, S_k, V_k, represents the partition of 'k'-largest singular values of U, S, V= svd(X). (For finding a suitable `k` you may like to google 'scree plot'). Thanks – eat Jul 07 '11 at 22:24
  • Thank you for the suggestions. I am not sure if we are on the same page. From my understanding, SVD or PCA are dimensionality reduction techniques and k-Means is a clustering technique. If I apply k-means directly on my high-dimensional data, the resulting clustering might be erroneous. For this purpose, the pre-processing step is generally to utilize a dimensionality reduction technique to reduce the number of dimensions and then apply a clustering algorithm to cluster the data. Please see next comment. – Legend Jul 08 '11 at 19:31
  • Let us assume that I accept what you are saying that no clustering is needed. In that case, what should be done if I want to bin the documents into different topics? Let us say my initial goal was to classify these documents into different topics, SVD just tells me what features of the documents are important and are useful in differentiating them but an additional step is still required to figure out how to cluster these documents isn't it? Please correct me if I am wrong at some point. There has been a work that did this: http://www.ijest-ng.com/ijest-ng-vol2-no2-pp59-66.pdf. Thank you – Legend Jul 08 '11 at 19:33
  • By the way, I want to clarify that I am not arguing against what you are saying but merely trying to clarify my understanding in case I am doing some conceptually wrong and I really appreciate your time in helping me out. Thank you for your patience. – Legend Jul 08 '11 at 19:36