3

I am looking for an efficient way (no for loops) to compute the euclidean distance between a set of samples and a set of clusters centroids.

Example:

import numpy as np
X = np.array([[1,2,3],[1, 1, 1],[0, 2, 0]])
y = np.array([[1,2,3], [0, 1, 0]])

Expected output:

array([[ 0., 11.],
       [ 5.,  2.],
       [10.,  1.]])

This is the squared euclidean distance between each sample in X to each centroid in y.

I came up with 2 solutions:

Solution 1 :

def dist_2(X,y):
    X_square_sum = np.sum(np.square(X), axis = 1)
    y_square_sum = np.sum(np.square(y), axis = 1)
    dot_xy = np.dot(X, y.T)
    X_square_sum_tile = np.tile(X_square_sum.reshape(-1, 1), (1, y.shape[0]))
    y_square_sum_tile = np.tile(y_square_sum.reshape(1, -1), (X.shape[0], 1))
    dist = X_square_sum_tile + y_square_sum_tile - (2 * dot_xy)
    return dist

dist = dist_2(X, y)

solution 2:

import scipy
dist = scipy.spatial.distance.cdist(X,y)**2

Performance (wall-clock time) of the the two solution

import time
X = np.random.random((100000, 50))
y = np.random.random((100, 50))

start = time.time()
dist = scipy.spatial.distance.cdist(X,y)**2
end = time.time()
print (end - start)

Average elapsed wall-clock time = 0.7 sec

start = time.time()
dist = dist_2(X,y)
end = time.time()
print (end - start)

Average elapsed wall-clock time = 0.3 sec

Test on a large number of centroids

X = np.random.random((100000, 50))
y = np.random.random((1000, 50))

Average elapsed wall-clock time of "solution 1" = 50 sec (+ Memory issue)

Average elapsed wall-clock time of "solution 2" = 6 sec !!!

Conclusion

It seems that "solution 1 is more efficient than "solution 2" with respect to the average elapsed wall-clock time (on small data-sets) but inefficient with respect to memory.

Any suggestions?

1 Answers1

3

This question is frequently asked in combination with nereast-neighbour search. If this is the case have a look at a kdtree approach. This will be far more efficient than calculating euclidian distances, both in memory consumption and performance.

If this is not the case, here are some possible approaches. The first two are from an answer of Divakar. The third one uses Numba for jit-compilation. The main difference to the first two versions is he avoidance of temporary arrays.

Three aproaches to calculate euclidian distances

import numpy as np
import numba as nb

# @Paul Panzer
#https://stackoverflow.com/a/42994680/4045774
def outer_sum_dot_app(A,B):
    return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)

# @Divakar
#https://stackoverflow.com/a/42994680/4045774
def outer_einsum_dot_app(A,B):
    return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)

@nb.njit()
def calc_dist(A,B,sqrt=False):
  dist=np.dot(A,B.T)

  TMP_A=np.empty(A.shape[0],dtype=A.dtype)
  for i in range(A.shape[0]):
    sum=0.
    for j in range(A.shape[1]):
      sum+=A[i,j]**2
    TMP_A[i]=sum

  TMP_B=np.empty(B.shape[0],dtype=A.dtype)
  for i in range(B.shape[0]):
    sum=0.
    for j in range(B.shape[1]):
      sum+=B[i,j]**2
    TMP_B[i]=sum

  if sqrt==True:
    for i in range(A.shape[0]):
      for j in range(B.shape[0]):
        dist[i,j]=np.sqrt(-2.*dist[i,j]+TMP_A[i]+TMP_B[j])
  else:
    for i in range(A.shape[0]):
      for j in range(B.shape[0]):
        dist[i,j]=-2.*dist[i,j]+TMP_A[i]+TMP_B[j]
  return dist

Timings

A = np.random.randn(10000,3)
B = np.random.randn(10000,3)

#calc_dist:                      360ms first call excluded due to compilation overhead
#outer_einsum_dot_app (Divakar): 1150ms
#outer_sum_dot_app (Paul Panzer):1590ms
#dist_2:                         1840ms

A = np.random.randn(1000,100)
B = np.random.randn(1000,100)

#calc_dist:                      4.3  ms first call excluded due to compilation overhead
#outer_einsum_dot_app (Divakar): 13.12ms
#outer_sum_dot_app (Paul Panzer):13.2 ms
#dist_2:                         21.3ms
max9111
  • 6,272
  • 1
  • 16
  • 33