Pairwise operations (distance) on two lists in numpy

Question

I have two lists of coordinates:

l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]

I want to find the shortest pairwise distance between l1 and l2. Distance between two coordinates is simply:

numpy.linalg.norm(l1_element - l2_element)

So how do I use numpy to efficiently apply this operation to each pair of elements?

If you are not averse to using scipy, take a look at `scipy.spatial.distance.cdist` — PeterE, Apr 13 '15 at 15:25

PeterE · Accepted Answer · 2015-04-13T19:42:47.003

Here is a quick performance analysis of the four methods presented so far:

import numpy
import scipy
from itertools import product
from scipy.spatial.distance import cdist
from scipy.spatial import cKDTree as KDTree

n = 100
l1 = numpy.random.randint(0, 100, size=(n,3))
l2 = numpy.random.randint(0, 100, size=(n,3))

# by @Phillip
def a(l1,l2):
    return min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))

# by @Kasra
def b(l1,l2):
    return numpy.min(numpy.apply_along_axis(
        numpy.linalg.norm,
        2,
        l1[:, None, :] - l2[None, :, :]
    ))

# mine
def c(l1,l2):
    return numpy.min(scipy.spatial.distance.cdist(l1,l2))

# just checking that numpy.min is indeed faster.
def c2(l1,l2):
    return min(scipy.spatial.distance.cdist(l1,l2).reshape(-1))

# by @BrianLarsen
def d(l1,l2):
    # make KDTrees for both sets of points
    t1 = KDTree(l1)
    t2 = KDTree(l2)
    # we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
    maxD = numpy.linalg.norm(l1[0] - l2[0]) # this could be closest but anyhting further is certainly not
    # get a sparce matrix of all the distances

    ans = t1.sparse_distance_matrix(t2, maxD)

    # get the minimum distance and points involved
    minD = min(ans.values())
    return minD

for x in (a,b,c,c2,d):
    print("Timing variant", x.__name__, ':', flush=True)
    print(x(l1,l2), flush=True)
    %timeit x(l1,l2)
    print(flush=True)

For n=100

Timing variant a :
2.2360679775
10 loops, best of 3: 90.3 ms per loop

Timing variant b :
2.2360679775
10 loops, best of 3: 151 ms per loop

Timing variant c :
2.2360679775
10000 loops, best of 3: 136 µs per loop

Timing variant c2 :
2.2360679775
1000 loops, best of 3: 844 µs per loop

Timing variant d :
2.2360679775
100 loops, best of 3: 3.62 ms per loop

For n=1000

Timing variant a :
0.0
1 loops, best of 3: 9.16 s per loop

Timing variant b :
0.0
1 loops, best of 3: 14.9 s per loop

Timing variant c :
0.0
100 loops, best of 3: 11 ms per loop

Timing variant c2 :
0.0
10 loops, best of 3: 80.3 ms per loop

Timing variant d :
0.0
1 loops, best of 3: 933 ms per loop

@nerdlyfe I do not understand. What do you mean? Of cause this is not done on a list of strings. We are talking about the distances of points in a multi/three-dimensional space. — PeterE, Jul 26 '21 at 06:39
apologies, you are correct. how would you do it for this one? https://stackoverflow.com/questions/68479448/pairwise-distances-between-list-of-strings-rows-within-a-dataframe-column — nerdlyfe, Jul 26 '21 at 07:09

Phillip · Answer 2 · 2015-04-13T15:35:29.250

2

Using newaxis and broadcasting, l1[:, None, :] - l2[None, :, :] is an array of the pairwise difference vectors. You can reduce this array to an array of norms using apply_along_axis and then take the min:

numpy.min(numpy.apply_along_axis(
    numpy.linalg.norm,
    2,
    l1[:, None, :] - l2[None, :, :]
))

Of course, this only works if l1 and l2 are numpy arrays, so if your lists in the question weren't pseudo-code, you'll have to add l1 = numpy.array(l1); l2 = numpy.array(l2).

edited Apr 13 '15 at 15:35

answered Apr 13 '15 at 15:30

Phillip

13,448
29
41

3

You could just use the `axis` keyword argument to `numpy.linalg.norm`. – IanH Apr 13 '15 at 19:18

score 1 · Answer 3 · answered Apr 13 '15 at 15:27

You can use itertools.product to get the all combinations the use min :

l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
from itertools import product
min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))

score 1 · Answer 4 · edited May 23 '17 at 11:47

1

If you have many, many, many points this is a great use for a KDTree. Totally overkill for this example but a good learning experience and really fast for a certain class of problems, and can give a bit more information on number of points within a certain distance.

import numpy as np
from scipy.spatial import cKDTree as KDTree

#sample data
l1 = [[0,0,0], [4,5,6], [7,6,7], [4,5,6]]
l2 = [[100,3,4], [1,0,0], [10,15,16], [17,16,17], [14,15,16], [-34, 5, 6]]
# make them arrays
l1 = np.asarray(l1)
l2 = np.asarray(l2)
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = np.linalg.norm(l1[-1] - l2[-1]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minA = min([(i,k) for k, i in ans.iteritems()])
print("Minimun distance is {0} between l1={1} and l2={2}".format(minA[0], l1[minA[1][0]], l2[minA[1][2]] ))

What this does is make a KDTree for the the sets of points then find all the distances for points within the guess distance and give back the distance and closest point. This post has a writeup of how a KDTree works.

edited May 23 '17 at 11:47

Community

1
1

answered Apr 13 '15 at 17:31

Brian Larsen

1,740
16
28

Two points: 1) your second last line is quite, well, bad. Building a list over all (considered) distances is quite inefficient at this point. 2) Even without that it appears to be several orders of magnitude slower than `scipy.spatial.distance.cdist`. I don't know why, in theory your methods sounds quite interesting and efficient to me. PS: I used up to 10000 points per list. – PeterE Apr 13 '15 at 18:55
1) Agreed, I had all sorts of issues with an argmin of the result and which key that went with, some betting thinking would help a lot here. 2) I found that too, was faster than variant b in the timing post, but not as fast as some. Perhaps 3d is not enough dimensions to make this method better. (or maybe it never is...) – Brian Larsen Apr 13 '15 at 20:00
Also I bet that the building of the KDtree is the slow part. Values can be appended cheaply so that might be an advantage for that use case. – Brian Larsen Apr 13 '15 at 20:12
I profiled the function: It appears, that the min call (in my version of your algo.) is the most expensive part. 50% of total are spent there. 25% are used to update a dict (from the scipy docs on cKDTree I infer, that the dict is only used to return the sparse_distance_matrix). So approx 75% of the total run time could probably be avoided by a better implementation of cKDTree. But still: With n=10 000 `numpy.min(scipy.spatial.distance.cdist(l1,l2))` takes about 1 second, the cKDtree one takes 260 seconds ... – PeterE Apr 13 '15 at 22:15

Pairwise operations (distance) on two lists in numpy

4 Answers4

Linked