As an alternative way, using numba accelerator in nopython parallel mode, can be one of the fastest methods. I have compared performances of numba and cdist
on various array sizes, both consume near the same time (e.g. both takes 8 second on my machine), perhaps numba beats cdist
on smaller arrays:
import numba as nb
@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def distances_numba(a, b):
arr = np.zeros((a.shape[0], b.shape[0]), dtype=np.float64)
temp_arr = np.zeros_like(arr)
for i in nb.prange(a.shape[0]):
for j in range(b.shape[0]):
for k in range(a.shape[1]):
temp_arr[i, j] += (a[i, k] - b[j, k]) ** 2
arr[i, j] = temp_arr[i, j] ** 0.5
return arr
I did not compare memory consumptions, but I think numba will be one of the best in this regard.
Update
We can parallelize the max9111 answer, that was 10-20 times faster than cdist
or my first solution. The parallelization makes the max9111 solution faster around 1.5 to 2 times. The benchmarks are based on some of my tests and need more evaluations.
@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def first(A):
TMP_A = np.zeros(A.shape[0], dtype=np.float64)
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
TMP_A[i] += A[i, j] ** 2
return TMP_A
@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def second(B):
TMP_B = np.zeros(B.shape[0], dtype=np.float64)
for i in nb.prange(B.shape[0]):
for j in range(B.shape[1]):
TMP_B[i] += B[i, j] ** 2
return TMP_B
@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def calc_dist_p(A, B):
dist = np.dot(A, B.T)
TMP_A = first(A)
TMP_B = second(B)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
dist[i, j] = (-2. * dist[i, j] + TMP_A[i] + TMP_B[j]) ** 0.5
return dist