3

I have 2 ndarrays of (n_samples, n_dimensions) and I want for each pair of corresponding rows, so the output would be (n_samples, )

Using sklearn's implementation I get (n_samples, n_samples) result - which obviously makes a lot of irrelevant calculations which is unacceptable in my case.

Using 1 - scipy's implementation is impossible because it expects vectors and not matrices.

What would be the most efficient way to execute what I'm looking for?

cs95
  • 379,657
  • 97
  • 704
  • 746
bluesummers
  • 11,365
  • 8
  • 72
  • 108

1 Answers1

2

Assuming the two arrays x and y have the same shape,

  1. Compute the element-wise dot product using np.einsum (reference)
  2. Compute the product of the L2 (euclidean) norm for each row of x and y
  3. Divide the result from (1) by (2)

def matrix_cosine(x, y):
    return np.einsum('ij,ij->i', x, y) / (
              np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1)
    )

And a little code to test;

x = np.random.randn(100000, 100)

%timeit matrix_cosine(x, x)
82.8 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

assert np.allclose(matrix_cosine(x, x), np.ones(x.shape[0]))
cs95
  • 379,657
  • 97
  • 704
  • 746
  • This is perfect, I'm still surprised this is not trivial in any of the libraries – bluesummers Mar 11 '18 at 09:23
  • @bluesummers This code is licensed under the "do whatever the hell you want with it" license, feel free to issue a PR to the devs on git with it. Cheers :) – cs95 Mar 11 '18 at 09:23