Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

Question

I have 4 tables with schema (app, text_id, title, text). Now I'd like to compute the cosine similarity between all possible text pairs (title & text concatenated) and store them eventually in a csv file with fields (app1, app2, text_id1, text1, text_id2, text2, cosine_similarity).

Since there are a lot of possible combinations it should run quite efficient. What is the most common approach here? I'd appreciate any pointers.

Edit: Although the provided reference might touch my problem, I still cant figure out how to approach this. Could someone provide more details on the strategy to accomplish this task? Next to the calculated cosine similarity I need also the corresponding text pairs as an output.

Possible duplicate of [What's the fastest way in Python to calculate cosine similarity given sparse matrix data?](http://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat) — Shreyash S Sarnayak, Jan 06 '17 at 11:22

score 7 · Accepted Answer · answered Jan 07 '17 at 23:45

The following is a minimal example to calculate the pairwise cosine similarities between a set of documents (assuming you have successfully retrieved the title and text from your database).

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assume thats the data we have (4 short documents)
data = [
    'I like beer and pizza',
    'I love pizza and pasta',
    'I prefer wine over beer',
    'Thou shalt not pass'
]

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(data) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)

'''
S looks as follows:
array([[ 1.        ,  0.4078538 ,  0.19297924,  0.        ],
       [ 0.4078538 ,  1.        ,  0.        ,  0.        ],
       [ 0.19297924,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

The first row of `S` contains the cosine similarities to every other element in `X`. 
For example the cosine similarity of the first sentence to the third sentence is ~0.193. 
Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones). 
Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.
'''

This works great, thanks. Based on that I have another two questions: First, I need to iterate through the array and check where the cosine is >= 0.8 and then I need to somehow get the document pairs (not only the position in the array, but also the name of the documents). How would you do this? Second, I have a lot of data, therefore computation cost is an issue. Is it possible to calculate only half of the matrix, since its anyway symmetric? — eoe, Jan 09 '17 at 11:01
@EmanuelGeorge you can get the indices from `S` by either doing `np.where(S>=0.8)` or by thresholding the whole array by doing `S[S<0.8] = 0`, which sets all elements with a sim<=0.8 to 0. A single entry in `S` represents the similarity between a pair of documents (index 0/1) is the similarity between document 0 and document 1). If you have list with document names than a mapping from S to the list of document names is straightforward. `sklearn` is very efficient, I don't think it will calculate anything twice. — tttthomasssss, Jan 09 '17 at 22:37

Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

1 Answers1