Optimise Euclidean distance matrix algorithm if only interested in closest points

Question

The following Euclidean distance algorithm creates a MxM matrix of distances between the rows of an MxN input matrix (representative of points in some N dimensional space). The speed of this algorithm scales in O(m^2). Can this be improved upon if only interested in the rows (i.e. points) that are closest to each other? (My downstream task constists of performing K-NN, amongst other things)

import numpy as np


vectors = np.random.randn(100, 20)
m = vectors.shape[0]

distances = np.zeros([m, m])
for i in range(m):
    vec = vectors[i]
    distances[i] = [np.linalg.norm(vec - vectors[j]) for j in range(m)]

This is probably the third time I've seen this question, but I think you're looking for on avg O(N logN) (or possibly better) kd-tree to find nearest neighbors, here's the scipy implementation https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html and here's a one of many Stackoverflow answers talking about kd-trees https://stackoverflow.com/questions/45127141/find-the-nearest-point-in-distance-for-all-the-points-in-the-dataset-python and here's Wikipedia https://en.wikipedia.org/wiki/K-d_tree — Aeronautix, Aug 02 '22 at 14:18
Thanks for the interesting reading. However, in my case I do need the distances matrix. I was wondering whether the fact that points that are far away are of no relevance to me could be used to speed up the operation somehow. — Atticus, Aug 02 '22 at 16:56

7shoe · Accepted Answer · 2022-08-03T07:25:45.147

I would suggest leveraging scipy's condensed distance matrix instead of the for-loop of pairwise comparisons. In particular,

from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(vectors))

provides a ~85x speedup! The documentation can be found on here.

Fundamentally, the complexity seems to remain quadratic (as you need to compare every element of vectors with one another). However, the implementation leverages symmetry and the fact that the distance of every element to itself is 0, thereby only computing the upper triangular sub-matrix and then mirroring it along the diagonal to obtain the quadratic distance matrix.

Your code ran in 71ms while SciPy ran in 0.83ms; the aforementioned 85x speed-up.

Regardless, if you try to run kNN you might want to consider scikit-learn where you can simply provide the vectors as X as shown on here.

Optimise Euclidean distance matrix algorithm if only interested in closest points

1 Answers1