10

I have two numpy arrays x and y containing float values. For each value in x, I want to find the closest element in y, without reusing elements from y. The output should be a 1-1 mapping of indices of elements from x to indices of elements from y. Here's a bad way to do it that relies on sorting. It removes each element that was paired from the list. Without sorting this would be bad because the pairing would depend on the order of the original input arrays.

def min_i(values):
    min_index, min_value = min(enumerate(values),
                               key=operator.itemgetter(1))
    return min_index, min_value

# unsorted elements
unsorted_x = randn(10)*10
unsorted_y = randn(10)*10

# sort lists
x = sort(unsorted_x)
y = sort(unsorted_y)

pairs = []
indx_to_search = range(len(y))

for x_indx, x_item in enumerate(x):
    if len(indx_to_search) == 0:
        print "ran out of items to match..."
        break
    # until match is found look for closest item
    possible_values = y[indx_to_search]
    nearest_indx, nearest_item = min_i(possible_values)
    orig_indx = indx_to_search[nearest_indx]
    # remove it
    indx_to_search.remove(orig_indx)
    pairs.append((x_indx, orig_indx))
print "paired items: "
for k,v in pairs:
    print x[k], " paired with ", y[v]

I prefer to do it without sorting the elements first, but if they are sorted then I want to get the indices in the original, unsorted lists unsorted_x, unsorted_y. what's the best way to do this in numpy/scipy/Python or using pandas? thanks.

edit: to clarify I'm not trying to find the best fit across all elemets (not minimizing sum of distances for example) but rather the best fit for each element, and it's okay if it's sometimes at the expense of other elements. I assume that y is generally much larger than x contrary to above example and so there's usually many very good fits for each value of x in y, and I just want to find that one efficiently.

can someone show an example of scipy kdtrees for this? The docs are quite sparse

kdtree = scipy.spatial.cKDTree([x,y])
kdtree.query([-3]*10) # ?? unsure about what query takes as arg
  • I think a sort with a binary search to find the index is probably your best bet. – mgilson Mar 12 '13 at 14:09
  • @mgilton: are there built in binary search algos in scipy/numpy? –  Mar 12 '13 at 14:10
  • Yep: [numpy.searchsorted](http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html) – mgilson Mar 12 '13 at 14:11
  • Seems to me like you'll need a combo of `np.sort`, `np.argsort` and `np.searchsorted` to accomplish it. – mgilson Mar 12 '13 at 14:13
  • what is the expected answer for `(5, 10), (6, 0)`? What is the expected answer for `(10, 5), (6,0)`? – Robᵩ Mar 12 '13 at 14:18
  • I suggest `scipy.spatial.cKDTree` (or just KDTree with older scipy). Of course argsorting does it too if you take care. – seberg Mar 12 '13 at 14:19
  • @seberg: could you show an example of `cKDTree` usage? –  Mar 12 '13 at 14:19
  • @Robᵩ: 5 should be paired with 6, 10 should be paired with 0. I see what you're getting at. I'm not trying to find the best fit across all elemets (not minimizing sum of distances for example) but rather the best fit for each element, and it's okay if it's sometimes at the expense of other elements. I forgot to add that one of my lists is much larger than the other, so in general there is a best fit for each element. assume `y` is much larger than `x` and you have to find a match for each `x` in `y` –  Mar 12 '13 at 14:24
  • @Robᵩ: It shouldn't, I was wrong, but I edited my answer. Basically, you can assume that there are multiple good fits for each value of x in y, but you are of course right that there will be these tradeoffs. –  Mar 12 '13 at 14:27
  • @user248237dfsf well I only skimmed it first, but cKDTree you can use to find the k-nearest neighbours of each point in one for each point in the other conveniently. To avoid duplicate pairings, you would have to postprocess, but it might be nicer to have a nearest neighbour function as opposed to sorting. – seberg Mar 12 '13 at 14:30
  • @seberg How would you go about getting `cKDTree` to compute distances between elements in two different sets? It seems to only allow computing the distances between the elements in a single set. – Jaime Mar 12 '13 at 15:11
  • 1
    @Jaime, not sure what you mean, you can get the k-nearest neighbours for points outside the set you query with it. `tree = KDTree(x[:,None]); tree.query(y[:,None], k=1)` finds the nearest `x` for all `y` (based on quadratic norm, you can change that). – seberg Mar 12 '13 at 15:25
  • @seberg Ah, that's nice. But since you are going to have to loop over all the items of `x` to discard duplicate neighbors, I think it is better to calculate distances to an ever smaller `y` array. The `KDTree` approach could work faster if you can come up with a value of `k`, smaller than the obvious `k=len(x)`, that guarantees you will not miss a non-duplicate neighbor for any of the `x`. – Jaime Mar 12 '13 at 16:20
  • related: [Millions of 3D points: How to find the 10 of them closest to a given point?](http://stackoverflow.com/q/2486093/4279) – jfs Mar 26 '14 at 20:39

1 Answers1

9

EDIT 2 A solution using KDTree can perform very well if you can choose a number of neighbors that guarantees that you will have a unique neighbor for every item in your array. With the following code:

def nearest_neighbors_kd_tree(x, y, k) :
    x, y = map(np.asarray, (x, y))
    tree =scipy.spatial.cKDTree(y[:, None])    
    ordered_neighbors = tree.query(x[:, None], k)[1]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    nearest_neighbor.fill(-1)
    used_y = set()
    for j, neigh_j in enumerate(ordered_neighbors) :
        for k in neigh_j :
            if k not in used_y :
                nearest_neighbor[j] = k
                used_y.add(k)
                break
    return nearest_neighbor

and a sample of n=1000 points, I get:

In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True

In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False

So the optimum is k=13, and then the timing is:

In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop

But in the worst case, you could need k=1000, and then:

In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop

Which is slower than the other options:

In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop

In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop

EDIT Sorting the array before searching pays off for arrays of more than 1000 items:

def nearest_neighbors_sorted(x, y) :
    x, y = map(np.asarray, (x, y))
    y_idx = np.argsort(y)
    y = y[y_idx]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.searchsorted(y, xj)
        if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
            idx -= 1
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)
    return nearest_neighbor

With a 10000 element long array:

In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop

In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop

For smaller arrays it performs slightly worse.


You are going to have to loop over all your items to implement your greedy nearest neighbor algorithm, if only to discard duplicates. With that in mind, this is the fastest I have been able to come up with:

def nearest_neighbors(x, y) :
    x, y = map(np.asarray, (x, y))
    y = y.copy()
    y_idx = np.arange(len(y))
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.argmin(np.abs(y - xj))
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)

    return nearest_neighbor

And now with:

n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)

I get:

In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • thanks. Is there a way to do it without duplicates using `cKDTree`? Even with slight performance hit? –  Mar 12 '13 at 17:39
  • another question: is there a way to ensure that `p.argmin(np.abs(y - xj))` will ignore missing values like NaN? Is there ever a case where it will pick those? –  Mar 12 '13 at 18:41
  • [np.nanargmin](http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanargmin.html) is what you want. – denis Dec 29 '13 at 14:37
  • Is this method also possible with multidimensional points?, because i always get the error : Buffer has wrong number of dimensions (expected 2, got 3) in the line "tree =scipy.spatial.cKDTree(y[:, None])" – Varlor Oct 30 '18 at 09:38
  • 1
    @jaime all 3 return errors in 2022: ---------------------------------------------------------------------------nearest_neighbors_kd_tree: ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.__init__() ValueError: data must be 2 dimensions--------------------- nearest_neighbors(x, y): ValueError: operands could not be broadcast together with shapes (127,) (2,) ---------------------nearest_neighbors_sorted(x,y): ValueError: object too deep for desired array – openSourcerer Apr 20 '22 at 16:22
  • 1
    OK, removing the "[:, None]"s from the KDTree solution runs without errors. However, there's no way to avoid duplicates without running the whole thing. On the full dataset, I hit memory issues MemoryError: Unable to allocate 155. GiB for an array with shape (144008, 144008) and data type float64 – openSourcerer Apr 21 '22 at 14:10