Dataframe faster cosine similarity

Question

I have a dataframe consisting of individual tweets (id, text, author_id, nn_list) where nn_list is a list of other tweet indices which were previously identified as potential nearest neighbours. Now I have to calculate the cosine similarity of the index and every single entry of this list by looking at the index in the tfidf matrix to compare the vectors but with my current approach this is kind of slow. The current code looks something like this:

for index, row in data_df.iterrows():
    for candidate in row["nn_list"]:
        candidate_cos = float("%.2f" % pairwise_distances(tfidf_matrix[candidate], tfidf_matrix[index], metric='cosine'))

        if candidate_cos < nn_distance:
            current_nn_candidate = candidate
            nn_distance = candidate_cos

Is there a significantly faster way to calculate this?

Hi, welcome to Stackoverflow. Please do some [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), because now it is hard for any of us to help you — dankal444, Nov 30 '21 at 20:37

score 0 · Accepted Answer · answered Dec 01 '21 at 08:26

The following code should work assuming you have not a too large range of IDs:

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame({"nn_list": [[1, 2], [1,2,3], [1,2,3,7], [11, 12, 13], [2,1]]})

# Data consistent with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
df["data"] = df["nn_list"].apply(lambda x: np.repeat(1, len(x)))
df["row"] = df.index
df["row_ind"] = df[['row', 'nn_list']].apply(lambda x: np.repeat(x[0], len(x[1])), axis=1)
df["col_ind"] = df['nn_list'].apply(lambda x: np.array(x))

m = csr_matrix(
    (np.concatenate(df['data']), 
    (np.concatenate(df['row_ind']), np.concatenate(df['col_ind']))))

cosine_similarity(m)

Will return:

array([[1.        , 0.81649658, 0.70710678, 0.        , 1.        ],
       [0.81649658, 1.        , 0.8660254 , 0.        , 0.81649658],
       [0.70710678, 0.8660254 , 1.        , 0.        , 0.70710678],
       [0.        , 0.        , 0.        , 1.        , 0.        ],
       [1.        , 0.81649658, 0.70710678, 0.        , 1.        ]])

If you have a larger range of IDs I recommend to use spark or have look to cosine similarity on large sparse matrix with numpy.

Dataframe faster cosine similarity

1 Answers1