1

If I have two arrays:

X = np.random.rand(10000,2)
Y = np.random.rand(10000,2)

How can I, for each point in X, find out which point in Y is closest to it? So that in the end I have an array showing:

x1_index   y_index_of_closest
   1               7
   2               54
   3               3
  ...             ...

I want to do this for both columns in X and compare each to each column and value in Y

ishido
  • 4,065
  • 9
  • 32
  • 42

2 Answers2

4

This question is pretty popular. Since similar questions keep getting closed and linked here, I think it's worth pointing out that even though the existing answers are quite fast for thousands of data points, they start to break down after that. My potato segfaults at 10k items in each array.

The potential problem with the other answers is the algorithmic complexity. They compare everything in X to everything in Y. To get around that, at least on average, we need a better strategy for ruling out some of the things in Y.

In one dimension this is easy -- just sort everything and start popping out nearest neighbors. In two dimensions there are a variety of strategies, but KD-trees are reasonably popular and are already implemented in the scipy stack. On my machine, there's a crossover between the various methods around the point where each of X and Y have 6k things in them.

from scipy.spatial import KDTree

tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)

The extremely poor performance of scipy's KDTree implementation has been a sore spot of mine for awhile, especially with as many things as have been built on top of it. There are probably data sets where it performs well, but I haven't seen one yet.

If you don't mind an extra dependency, you can get a 1000x speed boost just by switching your KDTree library. The package pykdtree is pip-installable, and I pretty much guarantee the conda packages work fine too. With this approach, my used, budget chromebook can process X and Y with 10 million points each in barely 30 seconds. That beats segfaulting at 10 thousand points with a wall time ;)

from pykdtree.kdtree import KDTree

tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)
Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37
  • `pykdtree` is awesome, you saved my day! For anyone who is interested, I have two 2D arrays of 32m (A) and 200k (B) each, I need to find the closest data point in A for all elements in B, I tried the scipy and scikit-learn KDTree approach in @Daniel's answer, they turn out to be unacceptably slow, but the calculation is done in less than one second with `pykdtree`. I am not an expert of KDTree, but this feels magic! Here is what I found, `pykdtree` > `scikit-learn KDtree` (Hangs at (32m, 200k) array sizes) > `scipy.spatial.distance.cdist` (MemoryError at (300k, 200k) array sizes). – zyxue Sep 19 '18 at 01:46
2

This has to be the most asked numpy question (I've answered it myself twice in the last week), but since it can be phrased a million ways:

import numpy as np
import scipy.spatial.distance.cdist as cdist

def withScipy(X,Y):  # faster
    return np.argmin(cdist(X,Y,'sqeuclidean'),axis=0)

def withoutScipy(X,Y): #slower, using broadcasting
    return np.argmin(np.sum((X[None,:,:]-Y[:,None,:])**2,axis=-1), axis=0)

There's also a numpy-only method using einsum that's faster than my function (but not cdist) but I don't understand it well enough to explain it.

EDIT += 21 months:

The best way to do this algorithmically though is with KDTree.

from sklearn.neighbors import KDTree 
# since the sklearn implementation allows return_distance = False, saving memory

y_tree = KDTree(Y)
y_index_of_closest = y_tree.query(X, k = 1, return_distance = False)

@HansMusgrave has a pretty good speedup for KDTree below.

And for completion's sake, the np.einsum answer, which I now understand:

np.argmin(                                      #  (X - Y) ** 2 
    np.einsum('ij, ij ->i', X, X)[:, None] +    # = X ** 2        \
    np.einsum('ij, ij ->i', Y, Y)          -    # + Y ** 2        \
    2 * X.dot(Y.T),                             # - 2 * X * Y
    axis = 1)

@Divakar explains this method well on the wiki page of his package eucl_dist

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • yes again... an einsum approach is posted here, and more detailed versions ones will bubble to the top using einsum and numpy in a keyword search – NaN Dec 12 '16 at 19:58
  • @DanielF Sorry to keep beating a dead horse. I've always found scipy (hence scikit-learn as well) to have a pretty awful KDTree for most problems. On my machine, I get a 1000x speed boost just by switching the library to pykdtree. I added the info to my answer. Do you think it's worth having here for completeness since you have the most votes? – Hans Musgrave Sep 18 '18 at 05:12
  • @Hans, no your answer is pretty well self-contained. I'll throw you an upvote though to get some more visibility on it. – Daniel F Sep 18 '18 at 05:47