Efficiently find the n nearest neighbours of 50k 2D coordinates?

Question

I have an array with latitudes and longitudes. The task is to find the 5 nearest coordinates for all coordinates without looping through all coordinates every time.

2 dimensional array? what? Maybe you mean 1-dim array of points in R^2? — SashaMN, Jan 18 '16 at 21:13
Insert all elements into [Binary heap](https://en.wikipedia.org/wiki/Binary_heap) using the Euclidean distance as the comparator. Grab top n elements. — OneCricketeer, Jan 18 '16 at 21:14
Have a look at https://en.m.wikipedia.org/wiki/Binary_search_tree BST's are very efficient — osanger, Jan 18 '16 at 21:25
Step 1 calculate all the distances step 2 sort based on distance, step 3 take the top 5 results. Then optimize your approach you could combine step 1 and 2 and use trees or heaps to store your data. — Marko, Jan 18 '16 at 21:31
@Marko no need to sort to find top5 values lol. And if I understand correctly, this is a slow solution, because you can't update your heap faster than in O(n) time. — SashaMN, Jan 18 '16 at 21:32
@SashaMN agreed the best solution is O(n) please ignore my previous post, my solution is only O(log n). — Marko, Jan 18 '16 at 21:49
There is an enormous literature on this. Google for "nearest neighbor search." The problem of finding 5 nearest neighbors can be solved by finding the nearest, temporarily considering it deleted and finding the nearest again, etc. 5 times. NB the binary heap and BST suggestions aren't the right way forward. @SalvadorDali has the right idea. — Gene, Jan 18 '16 at 22:24

score 3 · Accepted Answer · edited May 23 '17 at 10:27

There are a couple of solutions depending on your data (which you have told nothing about) and how sure you want to be.

if your data is uniformly distributed, then you can create a grid on top of your data and assign points to the grid. After that for every element you find out which grid it belongs to and compare distances in that grid (and in the closest grids). With good grid selection and assuming there are k elements in the grid on average this can give you a potential O(n * k^2) running time. Look at this answer for more explanation.
knowing nothing about the data, you can construct a 2-d tree in O(n log n) time and then for every point in your database to ask what is the closest point to it (you ask this in O(logn) for a total n points). So the total complexity is O(n log n )
another approach is to use probabilistic approach called local sensitivity hashing. The wiki page is too convoluted and even knowing what is this, it was hard for me to read that page. Take a look at this description to better understand it.
@Gene suggested another approach by using a quadtree (have not heard about this kind of tree, so will just leave it empty)

So as you see it is possible to get better than O(n^2) complexity for this task.

All the methods are describing how to find the closest point to a point you are searching for. It is obvious that after you found the closest point, you can remove it and find another closest point and so on till you found 5 closest point for your point.

Efficiently find the n nearest neighbours of 50k 2D coordinates?

1 Answers1

Linked