It's a classic question, but I believe many people still searching for answers. This question is a different than this one, since my question is operation between two sparse vectors (not a matrix).
I wrote a blog post about how Cosine Scipy Spatial Distance (SSD) is getting slower when the dimension of the data is getting higher (because it works on dense vectors). The post is in Indonesian language, but the code, my experiment settings & results should be easily understandable regardless of the language (at the bottom of the post).
Currently this solution is more than 70 times faster for high dimension data (compared to SSD) & more memory efficient:
import numpy as np
def fCosine(u,v): # u,v CSR vectors, Cosine Dissimilarity
uData = u.data; vData = v.data
denominator = np.sqrt(np.sum(uData**2)) * np.sqrt(np.sum(vData**2))
if denominator>0:
uCol = u.indices; vCol = v.indices # np array
intersection = set(np.intersect1d(uCol,vCol))
uI = np.array([u1 for i,u1 in enumerate(uData) if uCol[i] in intersection])
vI = np.array([v2 for j,v2 in enumerate(vData) if vCol[j] in intersection])
return 1-np.dot(uI,vI)/denominator
else:
return float("inf")
Is it possible to even further improve that function (Pythonic or via JIT/Cython???).