4

I have two 2D np.arrays let's call them A and B, both having the shape. For every vector in 2D array A I need to find the vector in matrix B, that have the minimum cosine distance. To do this I just have a double for loop inside of which I try to find the minimum value. So basically I do the following:

from scipy.spatial.distance import cosine
l, res = A.shape[0], []
for i in xrange(l):
    minimum = min((cosine(A[i], B[j]), j) for j in xrange(l))
    res.append(minimum[1])

In the code above one of the loop is hidden behind a comprehension. Everything works fine, but the double for loop makes it too slow (I tried to rewrite it with a double comprehension, which made things a little bit faster, but still slow).

I believe that there is a numpy function that can achieve the following faster (using some linear-algebra).

So is there a way to achieve what I want faster?

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753

2 Answers2

3

From the cosine docs we have the following info -

scipy.spatial.distance.cosine(u, v) : Computes the Cosine distance between 1-D arrays.

The Cosine distance between u and v, is defined as

enter image description here

where u⋅v is the dot product of u and v.

Using the above formula, we would have one vectorized solution using `NumPy's broadcasting capability, like so -

# Get the dot products, L2 norms and thus cosine distances
dots = np.dot(A,B.T)
l2norms = np.sqrt(((A**2).sum(1)[:,None])*((B**2).sum(1)))
cosine_dists = 1 - (dots/l2norms)

# Get min values (if needed) and corresponding indices along the rows for res.
# Take care of zero L2 norm values, by using nanmin and nanargmin  
minval = np.nanmin(cosine_dists,axis=1)
cosine_dists[np.isnan(cosine_dists).all(1),0] = 0
res = np.nanargmin(cosine_dists,axis=1)

Runtime tests -

In [81]: def org_app(A,B):
    ...:    l, res, minval = A.shape[0], [], []
    ...:    for i in xrange(l):
    ...:        minimum = min((cosine(A[i], B[j]), j) for j in xrange(l))
    ...:        res.append(minimum[1])
    ...:        minval.append(minimum[0])
    ...:    return res, minval
    ...: 
    ...: def vectorized(A,B):
    ...:     dots = np.dot(A,B.T)
    ...:     l2norms = np.sqrt(((A**2).sum(1)[:,None])*((B**2).sum(1)))
    ...:     cosine_dists = 1 - (dots/l2norms)
    ...:     minval = np.nanmin(cosine_dists,axis=1)
    ...:     cosine_dists[np.isnan(cosine_dists).all(1),0] = 0
    ...:     res = np.nanargmin(cosine_dists,axis=1)
    ...:     return res, minval
    ...: 

In [82]: A = np.random.rand(400,500)
    ...: B = np.random.rand(400,500)
    ...: 

In [83]: %timeit org_app(A,B)
1 loops, best of 3: 10.8 s per loop

In [84]: %timeit vectorized(A,B)
10 loops, best of 3: 145 ms per loop

Verify results -

In [86]: x1, y1 = org_app(A, B)
    ...: x2, y2 = vectorized(A, B)
    ...: 

In [87]: np.allclose(np.asarray(x1),x2)
Out[87]: True

In [88]: np.allclose(np.asarray(y1)[~np.isnan(np.asarray(y1))],y2[~np.isnan(y2)])
Out[88]: True
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Here is one example: all I do is just `A = np.load('A.npy')`, `B = np.load('B.npy')`, `x1, y1 = org_app(A, B)`, `x2, y2 = vectorized(A, B)` and then check x1 and list(x2). I get `x1=[459, 571, 526, 477, 309, 498, 504, 4, 529, ..` and `x2=[29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,...`. [A.npy](http://www.filedropper.com/a_7), [B.npy](http://www.filedropper.com/b_1) – Salvador Dali Sep 21 '15 at 09:04
  • 1
    There's also [scipy.spatial.distance.cdist](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.cdist.html) for calculating distances between all pairs across individual inputs. But it looks like it's a little slower than the solution here. – user2034412 Sep 21 '15 at 09:11
  • @SalvadorDali I don't think `scipy.spatial.distance.cdist` would give you the same results as cosine distances. To verify, look at the mininum distances and not just the min arguments from these two approaches. – Divakar Sep 21 '15 at 14:37
  • I posted a version with `scipy.spatial.distance.cdist` which does return the same results for random input but is slower. I can't access A.npy or B.npy so I can't check the results there. If you use what I posted you might have to fiddle with it to get it to deal with NaN the way you wanted. – user2034412 Sep 21 '15 at 16:54
1

Using scipy.spatial.distance.cdist:

from scipy.spatial.distance import cdist

def cdist_func(A, B):
    dists = cdist(A, B, 'cosine')
    return np.argmin(dists, axis=1), np.min(dists, axis=1)

It gets the same results as Divakar's answer:

x2, y2 = vectorized(A, B)
x3, y3 = cdist_func(A, B)

np.allclose(x2, x3) # True
np.allclose(y2, y3) # True

But it's not as fast:

%timeit vectorized(A, B) # 11.9 ms per loop
%timeit cdist_func(A, B) # 85.9 ms per loop
user2034412
  • 4,102
  • 2
  • 23
  • 22