Calculate cosine similarity for between all cases in a dataframe fast

Question

I'm working on an NLP project where I have to compare the similarity between many sentences E.G. from this dataframe:

The first thing I tried was to make a join of the dataframe with itself to get the bellow format and compare row by row:

The problem with this that I get out of memory quickly for big medium/big datasets, e.g. for a 10k rows join I will get 100MM rows which I can not fit in ram

My current aproach is to iterate over the dataframe with as:

final = pd.DataFrame()

### for each row 
for i in range(len(df_sample)):

    ### select the corresponding vector to compare with 
    v =  df_sample[df_sample.index.isin([i])]["use_vector"].values
    ### compare all cases agains the selected vector
    df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1)

    ### kept the cases with a similarity over a given th, in this case 0.6
    temp = df_sample[df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1) > 0.6]  
    ###  filter out the base case 
    temp = temp[~temp.index.isin([i])]
    temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
    ### append the result     
    final = pd.concat([final,temp])

But this aproach is not fast either. How can I improve the performance of this process?

Sergey Bushmanov · Answer 1 · 2021-05-19T05:24:28.300

One possible trick you may employ is to switch from sparse tfidf representation to dense word embeddings from Facebook's fasttext:

import fasttext
# wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
model = fasttext.load_model("./cc.en.300.bin")

Then you can proceed to calculate cosine similarity with more space efficient, context aware and better performing (?) dense word embeddings:

df = pd.DataFrame({"questions":["This is a question",
                                "This is a similar questin",
                                "And this one is absolutely different"]})

df["vecs"] = df["questions"].apply(model.get_sentence_vector)

from scipy.spatial.distance import pdist, squareform
# only pairwise distance with itself
# vectorized, no doubling data
out = pdist(np.stack(df['vecs']), metric="cosine")
cosine_similarity = squareform(out)
print(cosine_similarity)

[[0.         0.08294727 0.25305626]
 [0.08294727 0.         0.23575631]
 [0.25305626 0.23575631 0.        ]]

Note as well, on top of memory efficiency, you also gain about 10x speed increase due to using cosine similarity from scipy.

Another possible trick is to cast your similarity vectors from default float64 to float32 or float16:

df["vecs"] = df["vecs"].apply(np.float16)

which will give you both speed and memory gains.

Mr. For Example · Answer 2 · 2020-12-24T03:02:31.953

I just wrote an answer to a problem similar to yours yesterday, which is Top-K Cosine Similarity rows in a dataframe of pandas

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

data = {"use_vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))

vectors = []
for v in df['use_vector']:
    vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix, value for each pair at corresponding index of upper triangle of matrix
similarities = cosine_similarity(A)
# Set symmetrical(repetitive) and diagonal(similarity to self) to -2
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))

Outputs:

Data: 
          use_vector
0  [-0.1, -0.2, 0.3]
1  [0.1, -0.2, -0.3]
2  [-0.1, 0.2, -0.3]

Similarities:
[[-2.         -0.42857143 -0.85714286]  # vector 0 & 1, 2
 [-2.         -2.          0.28571429]  # vector 1 & 2
 [-2.         -2.         -2.        ]]

Calculate cosine similarity for between all cases in a dataframe fast

2 Answers2