I noticed that both scipy
and sklearn
have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors:
setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"
import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"
import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy: ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn: 11.072769448000145
scipy: 1.9755544730005568
sklearn
runs almost 10 times slower than scipy
(even if you remove the array reshape for the sklearn example and generate data that's already in the right shape). Why is one significantly slower than the other?