1

As show below my dataframe contains the following column

enter image description here I am intending to calculate a user-user cosine similarity matrix for all users.

Total Users: 75541 hence Total User Pair: 2853183570

I can do it in a .apply() method but it would take a lot of time. Is there a technique to do it in a faster way?

1 Answers1

2

Look at this answer that I just found.

It use scipy.sparse.csr_matrix to compress sparse matrix.

Then use sklearn.metrics.pairwise.cosine_similarity to compute cosine_similarity.

Or you can compute it use function below.

def cosine_similarity(matrix):
    norm = pd.DataFrame(np.sqrt(np.square(matrix).sum(axis = 1)))
    denominator = norm.dot(norm.T)
    numerator = matrix.dot(matrix.T)
    similarity_matrix = numerator.divide(denominator,axis =0)
    return similarity_matrix

This function is all matrix computation, no apply.

Dawei
  • 1,046
  • 12
  • 21