3

Im currently developing a program to compare two pieces of text based on its semantics (meaning). I understand there are libraries such as lingpipe which provide useful methods to compare string distances, however i've heard that LSA is the best method to measure text similarity.

I just have one confusion with using LSA to measure text similarity. I understand that the process is, with LSA,

1.Two passages are represented as two matrices X and Y. 

2.Using SVD, the matrices each are reduced to 3 different matrices 

3.And then the cosine distance is measured between the two matrices

4. The cosine distance determines how similar they are

I just want to know...

A. in SVD the matrix is reduced to 3 smaller matrix. So which of these smaller matrix is used in the cosine distance measurement?

B. Cosine distance is usually applied to vectors. So in the case of applying them to matrices, i assumed the matrix is iterated through and cosine distance is measured between every 2 vectors. And then the average of all these distances is assumed to be the final cosine distance between these two matrices?

I understand this is a very niche topic, but im hoping for some light on this 2 questions. Thanks

kype
  • 555
  • 1
  • 6
  • 24

1 Answers1

2

I think you started off on the wrong foot.

The collection of passages is represented as a type x document matrix. That is, rows represent the 'words' of the collection; columns represent the passages of the collection.

(Here you might want to apply the TF-IDF weighting scheme to the matrix.)

Using SVD you can decompose such a matrix (M) into three matrices (U,S, and V) so that

M = U * S * Vt

S is a diagonal matrix of the singular values of M sorted in decreasing order. You can perform dimension reduction by keeping the k first singular values and setting the others to 0.

Now you can regenerate the type x document matrix using the previous equation and start computing cosine similarity between row vectors (i.e. type similarity) or column vectors (i.e. passage similarity).

Pierre
  • 1,204
  • 8
  • 15
  • In that case why wouldn't one do cosine similiarty between the original "type x document" matrix? If the corpus size is small, SVD would just reduce the accuracy of the measurement isnt it? – kype Oct 13 '14 at 12:52
  • no, because when you regenerate the `type x document` matrix you redistribute the mass of information so that documents with no word in common yet similar yield a significant cosine similarity... – Pierre Oct 13 '14 at 12:54
  • with the traditional vector space model the `type x document` matrix is sparse. after applying dimension reduction the matrix is dense. – Pierre Oct 13 '14 at 12:57
  • Do you know any Java library that implements this (i.e generating SVD matrix from documents)? – kype Oct 13 '14 at 13:22
  • I don't know about any Java library, however there must be one... I have been using SVDLIBC as a standalone program, or Matlab svd function. – Pierre Oct 13 '14 at 13:26