7

I noticed that both scipy and sklearn have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors:

setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"

import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"

import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"

import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy:   ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn:  11.072769448000145
scipy:    1.9755544730005568

sklearn runs almost 10 times slower than scipy (even if you remove the array reshape for the sklearn example and generate data that's already in the right shape). Why is one significantly slower than the other?

Jay Mody
  • 3,727
  • 1
  • 11
  • 27
  • 1
    I am not familiar with inner workings of `sklearn` or `scipy`; however, beside the fact that you are reshaping the arrays in one experiment and not in the other, I don't think it's a fair comparison because the `cosine_similarity` computes pairwise cosine distance of all the samples in the two input arrays (although you are invoking it on arrays of one sample), but the `cosine` function in `scipy` works only on 1D-arrays and therefore might have a much more efficient implementation. – today Apr 29 '20 at 00:43
  • @today Even if you get rid of the array reshaping (create the arrays using `np.random.rand(1, 400)` instead of `np.random.rand(400)` to prevent the reshape), sklearn is still slower. I suspect the fact that sklearn is designed for 2d-arrays might have something to do with it, but still, the performance difference is quite a lot. – Jay Mody Apr 29 '20 at 01:12

1 Answers1

15

As mentioned in the comments section, I don't think the comparison is fair mainly because the sklearn.metrics.pairwise.cosine_similarity is designed to compare pairwise distance/similarity of the samples in the given input 2-D arrays. On the other hand, scipy.spatial.distance.cosine is designed to compute cosine distance of two 1-D arrays.

Maybe a more fair comparison is to use scipy.spatial.distance.cdist vs. sklearn.metrics.pairwise.cosine_similarity, where both computes pairwise distance of samples in the given arrays. However, to my surprise, that shows the sklearn implementation is much faster than the scipy implementation (which I don't have an explanation for that currently!). Here is the experiment:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist

x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)

def sklearn_cosine():
    return cosine_similarity(x, y)

def scipy_cosine():
    return 1. - cdist(x, y, 'cosine')

# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())

And here is the timing result:

%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop

%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop
today
  • 32,602
  • 8
  • 95
  • 115
  • 3
    I'm doing some work with cosine similarity at the moment. Scipy appears to run the job in a couple of Python loops, whereas Sklearn appears to use vectorized functions on the entire matrix. If you're doing a really small job, it will actually be quicker to use Scipy, but if both X and Y are large, you'll want Sklearn – jameslol Oct 03 '22 at 04:13