Compute Pairwise Cosine Similarity using scikit-learn

Question

I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial. Given a sentence and a list of other sentences (English):

s = "This concept of distance is not restricted to two dimensions."
list_s = ["It is not difficult to imagine the figure above translated into three dimensions.", "We can persuade ourselves that the measure of distance extends to an arbitrary number of dimensions;"]

I want to compute pairwise cosine similarity between each sentence in the list and sentence s, then find the max value.

What i've got so far:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
bow_matrix = tfidf.fit_transform([s, ' '.join(list_s)])

1. What's next?

2. Should we transform the whole corpus or just 2 sentences when compute pairwise cosine similarity?

3. How to apply removing stopwords and stemming for this?

Thanks!

score 1 · Accepted Answer · edited May 23 '17 at 10:33

First, you might want to transform your documents as follows

X = tfidf.fit_transform([s] + list_s) # now X will have 3 rows

What's next?: you have to find cosine similarity between each row of tf-idf matrix. See this post on how to do that. For intuition, you can calculate distance between s and list_s using cosine distance.
```
from scipy.spatial.distance import cosine
cosine(X[0].toarray(), X[1].toarray()) # cosine between s and 1st sentence
```
I would suggest transform whole corpus to tf-idf matrix since the model will also generate vocabulary i.e. you vector will correspond to this dictionary. You shouldn't transform only 2 sentences.
You can remove stopwords by adding stop_words='english' when you create tf-idf model (i.e. tfidf = TfidfVectorizer(..., stop_words='english')).

For stemming, you might consider nltk in order to create a stemmer. Here is a simple way to stem your texts. (note that you might want to remove punctuation before stemming also)

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

def stem(text):
    text_stem = [stemmer.stem(token) for token in text.split(' ')]
    text_stem_join = ' '.join(text_stem)
    return text_stem_join

list_s_stem = list(map(stem, list_s)) # map stem function to list of documents

Now, you can use this list_s_stem in TfidfVectorizer instead of list_s

Thanks, this is exactly what i want, i used your code and have another question: 1. cosine(...) or 1-(cosine) is correct? 2. Even if i changed the first sentence in the list_s by s, it's mean cosine(X[0], X[1]) should be = 1 because we're comparing a sentence with itself. But it doesn't. Can you help me figure out? Thanks! — Chelsea_cole, Jun 16 '16 at 06:02
you can use either cosine or 1-cosine. cosine will measure how similar documents are. 1-cosine is basically distance between two documents (cosine distance). For second question, cosine(X[0], X[0]) will compare `s` with `s`, i.e. `X[0], X[1], X[2]` are tf-idf transformation of `s, list_s[0], list_s[1]` respectively. — titipata, Jun 16 '16 at 06:52

Compute Pairwise Cosine Similarity using scikit-learn

1. What's next?

2. Should we transform the whole corpus or just 2 sentences when compute pairwise cosine similarity?

3. How to apply removing stopwords and stemming for this?

Thanks!

1 Answers1