How to run a large matrix for cosine similarity in Python?

Question

I want to calculate cosine similarity between articles. And I am running into the problem that my implementation approach would take a long time for the size of the data that I am going to run.

from scipy import spatial
import numpy as np 
from numpy import array
import sklearn
from sklearn.metrics.pairwise import cosine_similarity 

I = [[3, 45, 7, 2],[2, 54, 13, 15], [2, 54, 1, 13]]

II = [2, 54, 13, 15]

print cosine_similarity(II, I)

With the example above, to calculate I and II already took 1.0s and the dimension of my data is around (100K, 2K).

Is there other packages that I could use to run a huge matrix?

several examples are here http://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists — midori, Jan 20 '16 at 03:15
@minitoto the top answer is exactly the implementation I have. But I think it doesn't solve the problem of the big size data. — YAL, Jan 20 '16 at 23:13

score 2 · Answer 1 · answered Jan 20 '16 at 12:57

2

With sklearn.preprocessing.normalize, this works faster for me

result = np.dot(normalize(II, axis=1), normalize(I, axis=1).T)

(dot product between unit-normalized vectors is equivalent to cosine similarity).

answered Jan 20 '16 at 12:57

JARS

1,109
7
10

score 2 · Answer 2 · answered May 03 '16 at 09:15

2

You can use pairwise_kernels with metric='cosine' and n_jobs = . That will divide the data and run it in parallel

answered May 03 '16 at 09:15

Run2

1,839
22
32

Unfortunately unlike the cosine_similarity method, this doesn't support sparse output. – robodasha Feb 28 '17 at 22:35

How to run a large matrix for cosine similarity in Python?

2 Answers2