Cosine similarity between matching rows in numpy ndarrays

Question

I have 2 ndarrays of (n_samples, n_dimensions) and I want for each pair of corresponding rows, so the output would be (n_samples, )

Using sklearn's implementation I get (n_samples, n_samples) result - which obviously makes a lot of irrelevant calculations which is unacceptable in my case.

Using 1 - scipy's implementation is impossible because it expects vectors and not matrices.

What would be the most efficient way to execute what I'm looking for?

I'm looking for the similarity between each `x[n, :]` and `y[n, :]` - this should give me `n` results — bluesummers, Mar 11 '18 at 08:58

cs95 · Accepted Answer · 2018-03-11T09:22:15.843

2

Assuming the two arrays x and y have the same shape,

Compute the element-wise dot product using np.einsum (reference)
Compute the product of the L2 (euclidean) norm for each row of x and y
Divide the result from (1) by (2)

def matrix_cosine(x, y):
    return np.einsum('ij,ij->i', x, y) / (
              np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1)
    )

And a little code to test;

x = np.random.randn(100000, 100)

%timeit matrix_cosine(x, x)
82.8 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

assert np.allclose(matrix_cosine(x, x), np.ones(x.shape[0]))

edited Mar 11 '18 at 09:22

answered Mar 11 '18 at 09:07

cs95

379,657
97
704
746

This is perfect, I'm still surprised this is not trivial in any of the libraries – bluesummers Mar 11 '18 at 09:23
@bluesummers This code is licensed under the "do whatever the hell you want with it" license, feel free to issue a PR to the devs on git with it. Cheers :) – cs95 Mar 11 '18 at 09:23

Cosine similarity between matching rows in numpy ndarrays

1 Answers1