2

The code below is very inefficient given large matrices. Is there a better way to implement this ?

I have already searched the web for this here.

import numpy as np

def cosine_similarity(x, y):
    return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))

def compare(a, b):

    c = np.zeros((a.shape[0], b.shape[0]))

    for i, ai in enumerate(a):
        for j, bj in enumerate(b):
            c[i, j] = cosine_similarity(ai, bj)

    return c

a = np.random.rand(100,2000)
b = np.random.rand(800,2000)

compare(a,b) # shape -> (100, 800)
Loïc Sacré
  • 53
  • 1
  • 7
  • How about `a.dot(b.T)`? – Divakar Jul 08 '19 at 16:43
  • @Divakar not `a@b.T`? – Dan Jul 08 '19 at 16:43
  • 1
    @Dan Should be the same. @ is specific to Python3.x. – Divakar Jul 08 '19 at 16:44
  • @Divakar interesting, I would not have expected `dot` to work on matrices with `len(a.shape) != 1` – Dan Jul 08 '19 at 16:46
  • Loic, your algorithm is simply matrix multiplication. Use numpy's builtin capability and should be reasonably fast. But bear in mind your runtime will be almost cubic so if those matrices are getting _really_ big, it will be slow. In that case you'll need to establish if you really need to calculate every element of `c`. – Dan Jul 08 '19 at 16:50
  • 1
    Why is this function called `compare()`? Isn't this just matrix multiplication? – norok2 Jul 08 '19 at 16:50
  • Thanks you for your answers. Yes I agree this is a simple matrix multiplication. I have chosen to use the function np.dot as an illustrative purpose. However, the true method I would like to use is ```python def cosine_similarity(x, y): return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y))) ``` instead of np.dot (this is the reason my method is called ```compare``` ) – Loïc Sacré Jul 08 '19 at 17:36

2 Answers2

1

As in the comments, if you want to take the product of two matrices, then numpy already has an efficient implementation of this, but it might be too slow for you (O(n^3)).

import numpy as np

a=np.array([3,2,1])
b=np.array([1,2,3])
c=a.dot(b)
print(c) #output = 10

I saw in the comments that you were interested in the cosine distance between vectors. For the cosine similarity, consider using Scipy:

from scipy.spatial.distance import cosine

a=[1,0,1]
b=[0,1,0]
print(cosine(a,b)) #output = 1.0

This might be faster for your needs. Here is the documentation.

  • Thank you, I have tried to using Scipy but it did not improve the efficiency unfortunately. Nevertheless, I have written a much faster implementation, I am going to post it as an answer – Loïc Sacré Jul 08 '19 at 18:16
0

[Personal edit]

In order to compute the cosine similarity efficiently, here is a solution I have written:

def compare(a, b):
    x = np.atleast_2d(np.sqrt(np.sum(a*a, axis=1))).T
    y = np.atleast_2d(np.sqrt(np.sum(b*b, axis=1))).T
    return a.dot(b.T) / x.dot(y.T)
Loïc Sacré
  • 53
  • 1
  • 7
  • Did you try [`cdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html)? – Dan Jul 08 '19 at 21:25
  • Thank you for introducing this solution, I have tried with the following code ```cdist(a, b, cosine_similarity)```. The results are the same but it is still slower. – Loïc Sacré Jul 08 '19 at 22:27
  • Try `cdist(a, b, 'cosine')` as per the docs. This also has the advantage of already being thoroughly unit and user tested. – Dan Jul 08 '19 at 22:31
  • 1
    As their definition is a bit different, I have to do ```- (cdist(a, b, 'cosine') - 1)```. If the inputs are respectively ```a = np.random.rand(100,20000)``` and ```b = np.random.rand(800,20000)``` I obtain 2.26sec (cdist) vs 0.14sec – Loïc Sacré Jul 08 '19 at 22:37