4

I am trying to compute a vectorized implementation of Euclidean distance(between each element in X and Y using inner product). The data as follows:

X = np.random.uniform(low=0, high=1, size=(10000, 5))
Y = np.random.uniform(low=0, high=1, size=(10000, 5))

What I did was:

euclidean_distances_vectorized = np.array(np.sqrt(np.sum(X**2, axis=1) - 2 * np.dot(X, Y.T) + np.sum(Y**2, axis=1)))

Although this gives 'some output' the answer is wrong as each row still contains 5 elements.

Does anyone know what I am doing wrong?

tavalendo
  • 857
  • 2
  • 11
  • 30

1 Answers1

5

If I understood correctly this should do

np.linalg.norm(X - Y, axis=1)

Or with einsum (square root of the dot product of each difference pair along the first axis)

np.sqrt(np.einsum('ij,ij->i...', X - Y, X - Y))

If you want all pairwise distances

from scipy.spatial.distance import cdist

cdist(X, Y)
filippo
  • 5,197
  • 2
  • 21
  • 44
  • 3
    Or `np.sqrt(np.einsum('ij,ij->i...', *2*(X - Y,)))` so you don't calculate `X-Y` twice. – Paul Panzer Jun 03 '18 at 19:02
  • 2
    Also, `np.sqrt(ne.evaluate("(X - Y) ** 2").sum(1))` seems even faster with `numexpr`. – hilberts_drinking_problem Jun 03 '18 at 19:08
  • @PaulPanzer nice trick, didn't know about it. A bit obscure syntax though, shouldn't the interpreter do some kind of caching there? – filippo Jun 03 '18 at 19:17
  • Most of the time it's simpler to just create an intermediate `D = X-Y; np.sqrt ...` but the trick is useful for `lambda`s and one liners. Re the caching I'm not sure `X-Y` is officially guaranteed to always return the same answer, or have no side-effects, so I don't know whether the interpreter can rely on that. – Paul Panzer Jun 03 '18 at 19:54
  • 1
    Wow. You guys are amazing. Thank you. – tavalendo Jun 03 '18 at 22:20