45

I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:

xy1=numpy.array(
[[  243,  3173],
[  525,  2997]])

xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])

My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.

mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))

for i,xy in enumerate(xy1):
    dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
    mindist[i],minid[i]=dists.min(),dists.argmin()

Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.

Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?

divenex
  • 15,176
  • 9
  • 55
  • 55
fideli
  • 3,213
  • 5
  • 23
  • 24
  • 2
    If you need to speed up your code, you should remove the unnecessary numpy.sqrt (and only take the square root of the minimum squared distance when you have found it). – Eric O. Lebigot Dec 09 '09 at 10:03

9 Answers9

46

(Months later) scipy.spatial.distance.cdist( X, Y ) gives all pairs of distances, for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed here .

# cdist example: (nx,dim) (ny,dim) -> (nx,ny)

from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist

#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1

    # change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
    exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )

title = "%s  dim %d  nx %d  ny %d  metric %s" % (
        __file__, dim, nx, ny, metric )
print "\n", title

#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric )  # -> (nx, ny) distances
#...............................................................................

print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
        X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
        dist[0,3], cdist( [X[0]], [Y[3]] ))


# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim      ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2      ~ .4 +- .2/sqrt dim
denis
  • 21,378
  • 10
  • 65
  • 88
  • @denis cdist calculate distances between ALL pairs. How can I distance only between corresponding elements, for example, `[ dist(X[0],Y[0]), dist(X[1],Y[1]), ... dist(X[N],Y[N]) ]`, assuming `X` and `Y` are of same length `N`? – LWZ Jul 30 '13 at 00:24
  • 1
    @LWZ, just what you have -- `np.array([ dist( x, y ) for x, y in zip( X, Y )])` – denis Jul 30 '13 at 10:13
  • This works! And pretty fast. Just a note that the elements to calculate must be of length 2, otherwise python will raise an error. For a list of points in a contour detected by opencv2, I needed to use reshape function of numpy to reshape it first... – Jim Raynor Apr 30 '14 at 21:53
  • @denis: well e.g. the result of cv2.findContours is something like this: [ [[x1 y1]] [[x2 y2]].....[[xn yn]] ]. I don't know why they put the coordinates in 2 square bracket, but if you apply cdist directly to the list, each element will only have length 1 (a list, containing 1 list inside), I have to reshape it so the contour become the list of length-2 element (which means flat out the double bracket) – Jim Raynor May 01 '14 at 13:16
  • @Jim Raynor, right -- cdist expects arrays with ndim == 2, should check – denis May 01 '14 at 17:20
  • can you post an example of how to use it? there are none in the docs – endolith Jun 19 '14 at 03:50
31

To compute the m by p matrix of distances, this should work:

>>> def distances(xy1, xy2):
...   d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
...   d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
...   return numpy.hypot(d0, d1)

the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • I would go for cdist in this case, but +1'd and I've learnt from this solution cool stuff – eldad-a Feb 23 '14 at 10:16
  • @Alex: is there an efficient way to generalize this to more than two columns? – Mac Feb 17 '15 at 03:09
  • @Mac, see the accepted answer and `scipy.spatial.distance.cdist`. – Alex Martelli Feb 17 '15 at 14:50
  • @Alex: thanks, I did see that. I was rather hoping there'd be a way to generalize your code in particular though, so that I wouldn't have to add a dependency on SciPy just for that one function. Thanks anyway. – Mac Feb 17 '15 at 23:12
  • @Mac, not sure if I understood your question, but why not just doing np.hypot(np.hypot(d0, d1), d2), etc. (here d2 hardcoded, just for the sake of an example) for more dimensions? – Maciek Dec 07 '18 at 08:11
  • I tried various solutions, scipy.spatial.distance becomes fastest for long vectors of 3D points. – Maciek Dec 07 '18 at 19:27
17

The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.

Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.

All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.

This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows

import numpy as np
from scipy import spatial

xy1 = np.array(
    [[243,  3173],
     [525,  2997]])

xy2 = np.array(
    [[682, 2644],
     [277, 2651],
     [396, 2640]])

# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)

# This solution by @denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)

where mindist is the minimum distance between each point in xy1 and the set of points in xy2

divenex
  • 15,176
  • 9
  • 55
  • 55
  • You are right, this should be the accepted answer. But I suppose it's more opaque than what is the accepted answer. Thanks for this one though! – Srikiran Apr 14 '23 at 05:06
5

For what you're trying to do:

dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)

Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:

dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])
Alok Singhal
  • 93,253
  • 21
  • 125
  • 158
  • Oh my, that's amazing. I did not realize that element-by-element could work that way too. So `xy1[:,0,numpy.newaxis]` effectively replaces my for loop by being a column vector, from which all the *x*-values of `xy2` are subtracted. Very cool, thank you. – fideli Dec 09 '09 at 04:55
  • Yes. For a more general and elegant method, see Alex's answer. – Alok Singhal Dec 09 '09 at 04:58
  • @fideli: help(numpy.subtract.outer) tells you that the numpy.newaxis trick of Alok is what is also at work in Alex's answer. – Eric O. Lebigot Dec 09 '09 at 15:12
5
import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)
Maanasa Priya
  • 51
  • 2
  • 5
  • 1
    This works fantastically! This is the only solution that generalizes to an arbitrary number of dimensions. – gtmtg Nov 07 '17 at 19:10
0

I think the following function also works.

import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
    Y = X if Y is None else Y
    xx = (X ** 2).sum(axis = 1)[:, None]
    yy = (Y ** 2).sum(axis = 1)[:, None]
    return xx + yy.T - 2 * (X @ Y.T)

Explanation
Suppose each row of X and Y are coordinates of the two sets of points.
Let their sizes be m X p and p X n respectively.
The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.

Max Wong
  • 694
  • 10
  • 18
0

I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.

@alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.

>>> def distances(xy1, xy2):
...   d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
...   d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
...   return numpy.hypot(d0, d1)

Safety concerns

i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square

overflow/underflow/speeds

eroot163pi
  • 1,791
  • 1
  • 11
  • 23
0

I think that the most straightforward and efficient solution is to do it like this:

distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance 
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.
Mostafa Wael
  • 2,750
  • 1
  • 21
  • 23
0

Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.

Assuming you have two arrays like so:

# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)

You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.

deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points 
distance_matrix = (deltas ** 2).sum(axis=2)

The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :

distance_matrix.argmin(axis=1)

This returns the index of the point in b that is closest to each point of a.

Arthur Bricq
  • 316
  • 1
  • 8