Faster Python Cosine dissimilarity between Scipy CSR "vectors"

Question

It's a classic question, but I believe many people still searching for answers. This question is a different than this one, since my question is operation between two sparse vectors (not a matrix).

I wrote a blog post about how Cosine Scipy Spatial Distance (SSD) is getting slower when the dimension of the data is getting higher (because it works on dense vectors). The post is in Indonesian language, but the code, my experiment settings & results should be easily understandable regardless of the language (at the bottom of the post).

Currently this solution is more than 70 times faster for high dimension data (compared to SSD) & more memory efficient:

    import numpy as np

    def fCosine(u,v): # u,v CSR vectors, Cosine Dissimilarity
        uData = u.data; vData = v.data
        denominator = np.sqrt(np.sum(uData**2)) * np.sqrt(np.sum(vData**2))
        if denominator>0:
            uCol = u.indices; vCol = v.indices # np array
            intersection = set(np.intersect1d(uCol,vCol))
            uI = np.array([u1 for i,u1 in enumerate(uData) if uCol[i] in intersection])
            vI = np.array([v2 for j,v2 in enumerate(vData) if vCol[j] in intersection])             
            return 1-np.dot(uI,vI)/denominator
        else:
            return float("inf")

Is it possible to even further improve that function (Pythonic or via JIT/Cython???).

unutbu · Accepted Answer · 2016-04-03T23:04:50.620

Here is an alternative, alt_fCosine, which (on my machine) is about 3x faster for CSR vectors of size 10**5 and 10**4 non-zero elements:

import scipy.sparse as sparse
import numpy as np
import math

def fCosine(u,v): # u,v CSR vectors, Cosine Dissimilarity
    uData = u.data; vData = v.data
    denominator = np.sqrt(np.sum(uData**2)) * np.sqrt(np.sum(vData**2))
    if denominator>0:
        uCol = u.indices; vCol = v.indices # np array
        intersection = set(np.intersect1d(uCol,vCol))
        uI = np.array([u1 for i,u1 in enumerate(uData) if uCol[i] in intersection])
        vI = np.array([v2 for j,v2 in enumerate(vData) if vCol[j] in intersection])             
        return 1-np.dot(uI,vI)/denominator
    else:
        return float("inf")

def alt_fCosine(u,v): 
    uData, vData = u.data, v.data
    denominator = math.sqrt(np.sum(uData**2) * np.sum(vData**2))
    if denominator>0:
        uCol, vCol = u.indices, v.indices 
        uI = uData[np.in1d(uCol, vCol)]
        vI = vData[np.in1d(vCol, uCol)]
        return 1-np.dot(uI,vI)/denominator
    else:
        return float("inf")

# Check that they return the same result
N = 10**5
u = np.round(10*sparse.random(1, N, density=0.1, format='csr'))
v = np.round(10*sparse.random(1, N, density=0.1, format='csr'))
assert np.allclose(fCosine(u, v), alt_fCosine(u, v))

alt_fCosine replaces two list comprehensions, a call to np.intersection1d and the formation of a Python set with two calls to np.in1d and advanced indexing.

For N = 10**5:

In [322]: %timeit fCosine(u, v)
100 loops, best of 3: 5.73 ms per loop

In [323]: %timeit alt_fCosine(u, v)
1000 loops, best of 3: 1.62 ms per loop

In [324]: 5.73/1.62
Out[324]: 3.537037037037037

Awesome, thanks so much ... I wonder why math.sqrt is faster than numpy.sqrt? Is in general math is faster for simple domain (scalars/list)? — taufikedys, Apr 05 '16 at 06:24
Yes, `math.sqrt` is faster for scalars. I suspect this is also true for all the functions in the `math` module, since unlike the corresponding NumPy functions, they do not have to test alternate code paths (if array do this, if iterable do that, etc.). — unutbu, Apr 05 '16 at 11:20

Faster Python Cosine dissimilarity between Scipy CSR "vectors"

1 Answers1