This is more of a conceptual question than an actual implementation and am hoping someone could clarify. My goal is the following: Given a set of documents, I want to cluster them such that documents belonging to the same cluster have the same "concept".
From what I understand, Latent Semantic Analysis lets me find a low rank approximation of a term-document matrix i.e. given a matrix X, it will decompose X as a product of three matrices, out of which one would be a diagonal matrix Σ:
Now, I would proceed by choosing a low rank approximation i.e. choose only the top-k values from Σ, and then calculate X'. Once I have this matrix, I have to apply some clustering algorithm and the end result would be set of clusters grouping documents with similar concepts. Is this the right way of applying clustering? I mean, calculating X' and then applying clustering on top of it or is there some other method that is followed?
Also, in a somewhat related question of mine, I was told that the meaning of a neighbor is lost as the number of dimensions increases. In that case, what is the justification for clustering these high dimensional data points from X'? I am guessing that the requirement to cluster similar documents is a real-world requirement in which case, how does one go about addressing this?