0

I have two arrays like

a == array([[x0, y0], [x1, y1], ... ,[xn, yn]])
b == array([[u0, v0], [u1, v1], ... ,[un, vn]])

e.g. for (x0, y0) in a, I need to find its closest correspondent (e.g. based on Euclidean distance) on b, and I need to do this for all elements in a. What's the fastest way in python?

zyxue
  • 7,904
  • 5
  • 48
  • 74
  • Are they arrays of tuples? – DYZ Sep 17 '18 at 04:40
  • No, just two-D numpy arrays – zyxue Sep 17 '18 at 04:43
  • Then you should different kind of brackets. – DYZ Sep 17 '18 at 04:44
  • here too: https://stackoverflow.com/questions/49467836/finding-the-closest-point-of-an-array-using-numpy-linarg-norm-function – Reblochon Masque Sep 17 '18 at 04:45
  • 2
    Is there an algorithmically better way to do this though? In one dimension, you can get log-linear time with a sort. All of the 2D answers I saw just rely on vectorization and hope that the quadratic runtime won't backfire. – Hans Musgrave Sep 17 '18 at 04:47
  • There is also memory issue, my two arrays are in the shape of 35k and 2m long. – zyxue Sep 17 '18 at 04:57
  • @zyxue Yeah, my runs out of RAM and segfaults when I run the code in the other answer. I came up with a loglinear solution that's linear in memory. The crossover point is around 6k items in each list on my machine, and it'll still take awhile, but it's leaps and bounds better for big arrays. Give me a few minutes and I'll answer the other question that this got linked to as a duplicate. – Hans Musgrave Sep 17 '18 at 04:59
  • I would arrange a coarse 2d grid and assign the points to the grid cells. This would hopefully limit the search area because the closest neighbor must be in the same or in one of the neighboring grid cells. I thing this is closer to log-linear than to quadratic. – DYZ Sep 17 '18 at 05:04
  • @DYZ Mmhmm. Not sure if any of these can get log-linear in the worst case, but log-linear on average could be done lots of different ways. I went with a KD-tree since it's implemented in scipy already (though I think their implementation is a bit slow). – Hans Musgrave Sep 17 '18 at 05:51
  • 1
    There is an algortihmically better solution, @HansMusgrave. But rather than get this question reopened, and since I already have the highest rated answer in the duplicate, I'll just add it there. – Daniel F Sep 17 '18 at 06:09
  • @DanielF Sounds good. I had already answered in the duplicate by the way. Do you think that's fine? – Hans Musgrave Sep 17 '18 at 06:15
  • @HansMusgrave No problem, you didn't do KDTree, which is the best answer IMO – Daniel F Sep 17 '18 at 06:17
  • 1
    @DanielF Oh, but I did do KDTree. It only took a couple lines. I just added a 1D solution as well which has guaranteed asymptotic behavior and is thousands of times faster than applying a KDTree to 1D data. – Hans Musgrave Sep 17 '18 at 06:24
  • @HansMusgrave oops missed that. I did do it a bit differently than you, using `sklearn`'s implementation so I don't have to save the distances – Daniel F Sep 17 '18 at 06:54
  • [Canonical Duplicate](https://stackoverflow.com/questions/52366421/how-to-do-n-d-distance-and-nearest-neighbor-calculations-on-numpy-arrays) – Daniel F Sep 18 '18 at 10:37
  • I confirm @HansMusgrave's method with `pykdtree` is superior. `pykdtree` > `scikit-learn KDtree` (Hangs at (32m, 200k) array sizes) > `scipy.spatial.distance.cdist` (MemoryError at (300k, 200k) array sizes). – zyxue Sep 19 '18 at 01:50

0 Answers0