Term document matrix and cosine similarity in Python

Question

I have following situation that I want to address using Python (preferably using numpy and scipy):

Collection of documents that I want to convert to a sparse term document matrix.
Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).

How do I achieve this in Python? I know I can use scipy.sparse.coo_matrix to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix row vectors)?

Thanks.

score 5 · Accepted Answer · edited May 23 '17 at 12:02

5

May I recommend you take a look at scikit-learn? This is a very well regarded library in the Python community with a very simple a consistent API. They have also implemented a cosine similarity metric. This is an example taken from here of how you could do it in 3 lines of code:

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

edited May 23 '17 at 12:02

Community

1
1

answered Aug 07 '13 at 21:38

elyase

39,479
12
112
119

Thanks for the answer. Can you please explain attributes `T` and `A` in `(tfidf * tfidf.T).A`? – abhinavkulkarni Aug 08 '13 at 23:01
@abhinavkulkarni, Sure, `.T` gets you the transpose matrix and `.A` converts from sparse to normal dense representation. – elyase Aug 09 '13 at 00:16
@abhinavkulkarni my tfidf.shape returns (21488, 12602) which ultimately returns MemoryError. Can you tell me the way to handle such large document? – ashim888 Oct 28 '14 at 07:08
In this case you would have to use a [Hashing Vectorizer](http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing) or use more RAM. If it is too much RAM for a computer then I recommend you use Apache Spark for a distributed(several PCs) calculation. – elyase Oct 28 '14 at 11:08

score 0 · Answer 2 · edited May 23 '17 at 11:46

0

you can refer to this question

Python: tf-idf-cosine: to find document similarity

I have answered the question in which you can find the cosine similarity with scikit package.

edited May 23 '17 at 11:46

Community

1
1

answered Sep 20 '13 at 11:00

Gunjan

2,775
27
30

Term document matrix and cosine similarity in Python

2 Answers2