How to optimize calculatning distances between all vectors of two matrices?

Question

I would like to calculate distance between each horizontal vector in square matrix X and each horizontal vector in square matrix Y.

import numpy as np
from tqdm import tqdm

def euclidean_dist(x, y) -> float:
    return np.linalg.norm(x - y)

def dist(X, Y):

    def calc(y):
        def calc2(x):
            return euclidean_dist(x, y)
        return calc2

    distances = [np.apply_along_axis(calc(y), 1, X) for y in tqdm(Y)]
    return np.asarray(distances)

While for small matrices it works fine, for large matrices it's terribly slow. For instance, for matrices of size 14000 tqdm has estimated time of 2h.

size = 14000
X = np.random.rand(size,size)
Y = np.random.rand(size,size)
D = dist(X, Y)

How can I make it more optimal?

Dani Mesejo · Accepted Answer · 2020-04-17T10:27:25.350

You can use cdist:

import numpy as np
from scipy.spatial.distance import cdist

size = 14000
X = np.random.rand(size, size)
Y = np.random.rand(size, size)

result = cdist(X, Y)

From the documentation:

Compute distance between each pair of the two collections of inputs.

It can handle a bunch of distances, but the default is the euclidean.

A small snippet:

import numpy as np
from scipy.spatial.distance import cdist


coords = [(35.0456, -85.2672),
          (35.1174, -89.9711),
          (35.9728, -83.9422),
          (36.1667, -86.7833)]
result = cdist(coords, coords, 'euclidean')
print(result)

Output

[[0.         4.70444794 1.6171966  1.88558331]
 [4.70444794 0.         6.0892811  3.35605413]
 [1.6171966  6.0892811  0.         2.84770898]
 [1.88558331 3.35605413 2.84770898 0.        ]]

How to optimize calculatning distances between all vectors of two matrices?

1 Answers1