Vectorized implementation for Euclidean distance

Question

I am trying to compute a vectorized implementation of Euclidean distance(between each element in X and Y using inner product). The data as follows:

X = np.random.uniform(low=0, high=1, size=(10000, 5))
Y = np.random.uniform(low=0, high=1, size=(10000, 5))

What I did was:

euclidean_distances_vectorized = np.array(np.sqrt(np.sum(X**2, axis=1) - 2 * np.dot(X, Y.T) + np.sum(Y**2, axis=1)))

Although this gives 'some output' the answer is wrong as each row still contains 5 elements.

Does anyone know what I am doing wrong?

@feijao `scipy` has a function that computes pairwise euclidean distances between two arrays in a vectorized way. — hilberts_drinking_problem, Jun 03 '18 at 16:34

filippo · Accepted Answer · 2018-06-03T17:10:53.953

5

If I understood correctly this should do

np.linalg.norm(X - Y, axis=1)

Or with einsum (square root of the dot product of each difference pair along the first axis)

np.sqrt(np.einsum('ij,ij->i...', X - Y, X - Y))

If you want all pairwise distances

from scipy.spatial.distance import cdist

cdist(X, Y)

edited Jun 03 '18 at 17:10

answered Jun 03 '18 at 16:44

filippo

3

Or `np.sqrt(np.einsum('ij,ij->i...', *2*(X - Y,)))` so you don't calculate `X-Y` twice. – Paul Panzer Jun 03 '18 at 19:02
2

Also, `np.sqrt(ne.evaluate("(X - Y) ** 2").sum(1))` seems even faster with `numexpr`. – hilberts_drinking_problem Jun 03 '18 at 19:08
@PaulPanzer nice trick, didn't know about it. A bit obscure syntax though, shouldn't the interpreter do some kind of caching there? – filippo Jun 03 '18 at 19:17
Most of the time it's simpler to just create an intermediate `D = X-Y; np.sqrt ...` but the trick is useful for `lambda`s and one liners. Re the caching I'm not sure `X-Y` is officially guaranteed to always return the same answer, or have no side-effects, so I don't know whether the interpreter can rely on that. – Paul Panzer Jun 03 '18 at 19:54
1

Wow. You guys are amazing. Thank you. – tavalendo Jun 03 '18 at 22:20

1 Answers1