3

I have following situation that I want to address using Python (preferably using numpy and scipy):

  1. Collection of documents that I want to convert to a sparse term document matrix.
  2. Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).

How do I achieve this in Python? I know I can use scipy.sparse.coo_matrix to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix row vectors)?

Thanks.

abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54

2 Answers2

5

May I recommend you take a look at scikit-learn? This is a very well regarded library in the Python community with a very simple a consistent API. They have also implemented a cosine similarity metric. This is an example taken from here of how you could do it in 3 lines of code:

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])
Community
  • 1
  • 1
elyase
  • 39,479
  • 12
  • 112
  • 119
  • Thanks for the answer. Can you please explain attributes `T` and `A` in `(tfidf * tfidf.T).A`? – abhinavkulkarni Aug 08 '13 at 23:01
  • @abhinavkulkarni, Sure, `.T` gets you the transpose matrix and `.A` converts from sparse to normal dense representation. – elyase Aug 09 '13 at 00:16
  • @abhinavkulkarni my tfidf.shape returns (21488, 12602) which ultimately returns MemoryError. Can you tell me the way to handle such large document? – ashim888 Oct 28 '14 at 07:08
  • In this case you would have to use a [Hashing Vectorizer](http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing) or use more RAM. If it is too much RAM for a computer then I recommend you use Apache Spark for a distributed(several PCs) calculation. – elyase Oct 28 '14 at 11:08
0

you can refer to this question

Python: tf-idf-cosine: to find document similarity

I have answered the question in which you can find the cosine similarity with scikit package.

Community
  • 1
  • 1
Gunjan
  • 2,775
  • 27
  • 30