Calculate Cosine Similarity of all pairs in a column on a large data frame

Question

As show below my dataframe contains the following column

I am intending to calculate a user-user cosine similarity matrix for all users.

Total Users: 75541 hence Total User Pair: 2853183570

I can do it in a .apply() method but it would take a lot of time. Is there a technique to do it in a faster way?

score 2 · Answer 1 · answered Nov 23 '17 at 10:19

Look at this answer that I just found.

It use scipy.sparse.csr_matrix to compress sparse matrix.

Then use sklearn.metrics.pairwise.cosine_similarity to compute cosine_similarity.

Or you can compute it use function below.

def cosine_similarity(matrix):
    norm = pd.DataFrame(np.sqrt(np.square(matrix).sum(axis = 1)))
    denominator = norm.dot(norm.T)
    numerator = matrix.dot(matrix.T)
    similarity_matrix = numerator.divide(denominator,axis =0)
    return similarity_matrix

This function is all matrix computation, no apply.

Calculate Cosine Similarity of all pairs in a column on a large data frame

1 Answers1