What is the best way to implement an element-wise cosine similarity in Python?

Question

The code below is very inefficient given large matrices. Is there a better way to implement this ?

I have already searched the web for this here.

import numpy as np

def cosine_similarity(x, y):
    return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))

def compare(a, b):

    c = np.zeros((a.shape[0], b.shape[0]))

    for i, ai in enumerate(a):
        for j, bj in enumerate(b):
            c[i, j] = cosine_similarity(ai, bj)

    return c

a = np.random.rand(100,2000)
b = np.random.rand(800,2000)

compare(a,b) # shape -> (100, 800)

@Divakar interesting, I would not have expected `dot` to work on matrices with `len(a.shape) != 1` — Dan, Jul 08 '19 at 16:46
Loic, your algorithm is simply matrix multiplication. Use numpy's builtin capability and should be reasonably fast. But bear in mind your runtime will be almost cubic so if those matrices are getting _really_ big, it will be slow. In that case you'll need to establish if you really need to calculate every element of `c`. — Dan, Jul 08 '19 at 16:50
Why is this function called `compare()`? Isn't this just matrix multiplication? — norok2, Jul 08 '19 at 16:50
Thanks you for your answers. Yes I agree this is a simple matrix multiplication. I have chosen to use the function np.dot as an illustrative purpose. However, the true method I would like to use is ```python def cosine_similarity(x, y): return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y))) ``` instead of np.dot (this is the reason my method is called ```compare``` ) — Loïc Sacré, Jul 08 '19 at 17:36

Inko Bovenzi · Answer 1 · 2019-07-08T17:59:13.580

1

As in the comments, if you want to take the product of two matrices, then numpy already has an efficient implementation of this, but it might be too slow for you (O(n^3)).

import numpy as np

a=np.array([3,2,1])
b=np.array([1,2,3])
c=a.dot(b)
print(c) #output = 10

I saw in the comments that you were interested in the cosine distance between vectors. For the cosine similarity, consider using Scipy:

from scipy.spatial.distance import cosine

a=[1,0,1]
b=[0,1,0]
print(cosine(a,b)) #output = 1.0

This might be faster for your needs. Here is the documentation.

edited Jul 08 '19 at 17:59

answered Jul 08 '19 at 17:40

Inko Bovenzi

71
6

Thank you, I have tried to using Scipy but it did not improve the efficiency unfortunately. Nevertheless, I have written a much faster implementation, I am going to post it as an answer – Loïc Sacré Jul 08 '19 at 18:16

score 0 · Answer 2 · answered Jul 08 '19 at 18:18

0

[Personal edit]

In order to compute the cosine similarity efficiently, here is a solution I have written:

def compare(a, b):
    x = np.atleast_2d(np.sqrt(np.sum(a*a, axis=1))).T
    y = np.atleast_2d(np.sqrt(np.sum(b*b, axis=1))).T
    return a.dot(b.T) / x.dot(y.T)

answered Jul 08 '19 at 18:18

Loïc Sacré

53
1
7

Did you try [`cdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html)? – Dan Jul 08 '19 at 21:25
Thank you for introducing this solution, I have tried with the following code ```cdist(a, b, cosine_similarity)```. The results are the same but it is still slower. – Loïc Sacré Jul 08 '19 at 22:27
Try `cdist(a, b, 'cosine')` as per the docs. This also has the advantage of already being thoroughly unit and user tested. – Dan Jul 08 '19 at 22:31
1

As their definition is a bit different, I have to do ```- (cdist(a, b, 'cosine') - 1)```. If the inputs are respectively ```a = np.random.rand(100,20000)``` and ```b = np.random.rand(800,20000)``` I obtain 2.26sec (cdist) vs 0.14sec – Loïc Sacré Jul 08 '19 at 22:37

What is the best way to implement an element-wise cosine similarity in Python?

2 Answers2