49

What is the fastest way to find closest point to the given point in data array?

For example, suppose I have an array A of 3D points (with coordinates x, y and z, as usual) and point (x_p, y_p, z_p). How do I find the closest point in A to (x_p, y_p, z_p)?

As far as I know, slowest way to do it is to use linear search. Are there any better solutions?

Addition of any an auxiliary data structure is possible.

nbro
  • 15,395
  • 32
  • 113
  • 196
qutron
  • 1,710
  • 4
  • 18
  • 30

9 Answers9

31

You may organize your points in an Octree. Then you only need to search a small subset.

A Octree is a fairly simple data structure you can implement yourself (which would be a valuable learning experience), or you may find some helpful libraries to get you going.

nbro
  • 15,395
  • 32
  • 113
  • 196
dkamins
  • 21,450
  • 7
  • 55
  • 59
  • 9
    The algorithms suggested here are effective only if we need to repeatedly search for a nearest neighbor for a lot of points. If we just need the information for one point, a linear search is more efficient. – efficiencyIsBliss Dec 07 '10 at 20:42
  • 2
    Elaborating on my comment, building the tree itself (KD Tree or OC Tree) will be worse than linear. I'm not sure about OC trees, but KD Trees take O(NlogN). Therefore, for a single query a linear search is better. – efficiencyIsBliss Dec 07 '10 at 20:44
  • 1
    @efficiencyIsBliss But you konw what is the cost of calculating hundreds of thousands of multiplcations only for kNN of a single point?This could be reduced to just a few distance calculation in Octree and the cost of building an octree is much less than the overhead of the numerous multiplications in the distance calculation in linear serach even it is just for one point because nowadays, point cloud is easily large enough to have more than 100k points. – Gab是好人 May 08 '16 at 10:59
21

If you're doing a once-off nearest neighbour query, then a linear search is really the best you can get. This is of course assuming the data isn't pre-structured.

However, if you're going to be doing lots of queries there are a few space-partitioning data structures.These take some preprocessing to form the structure, but then can answer nearest neighbour queries very fast.

Since you're dealing with 3D space, I recommend looking at either octrees or kd-trees. Kd-trees are more generic (they work for arbitrary dimensions) and can be made more efficient than octrees if you implement a suitable balancing algorithm (e.g. median works well), but octrees are easier to implement.

ANN is a great library using these data structures, but also allowing for approximate nearest neighbor queries which are significantly faster but have a small error as they're just approximations. If you can't take any error, then set the error bound to 0.

moinudin
  • 134,091
  • 45
  • 190
  • 216
6

I suggest KD-tree will work fine. Also good for nearest neighbour searches.

nbro
  • 15,395
  • 32
  • 113
  • 196
Vedang Joshi
  • 149
  • 1
  • 2
  • 7
3

I needed to do this rather heavily for the many nearest neighbors search in a real time environment, and hit on a better algorithm both in terms of simplicity and speed.

Take all your points and put a copy into d lists where d is the dimensionality of the space. In your case 3. Sort those three lists according to their dimension. This costs d(nlog(n)) time. And that's it for the data structure.

We maintain these properly sorted lists in each dimension for all the points in question. The trick is that by definition the distance in one direction must be less than or equal to the euclidean distance. So if the distance in one direction is greater than our current closest distance of the closest known point then that point cannot be closer, and more importantly all points in that direction cannot be greater. Once this is true for the 2*d directions we have we by definition have the closest point.

For each particular element we can binarysearch into the sorted lists to find the nearest position where the required point could be in the two different dimensions. Mathematically we know that if the distance in the +x -x +y -y (other dimensions are easy to add) directions exceeds smallest known Euclidean distance to a point, that that point must exceed the distance, and since it's a sorted array, by definition, when we exceed that distance in that direction, we know we can abort that direction, because there can be no better answer in that direction. But, as we expand out in these four directions we can reduce our value of m because it is equal to the euclidean distance of the closest point we found.

So we only need sorted lists for each axis sorted according to that axis. Which is pretty simple.

Then to query the list:

  • We binary search into each of the lists (dlog(n)).
  • We find our current minimum distance, m. (initially it can be infinity)
  • For each list, we travel in the positive and negative directions.
  • For each of the 2*d directions we have,
    • We transverse the lists, lowering m when we find closer points.
  • When a direction proves itself to be mathematically fruitless, we stop searching that way.
  • When no direction remains we have found our closest point.

We have sorted lists and need to find the point we are searching for in each direction in the list. We binary search in to keep our time complexity log(n). Then we have our best current distance (possibly infinity) and then we move in each direction we have available to us. As we find new points, we update the closest point so far we have. The trick is that we quit as soon as the distance in just that one direction is further than our current known closest point.

So if we have a point at a known closest distance of 13 then we can abort checking in the +x, -x, +y, -y, directions as soon as the distance in just that direction exceeds our closest known distance. Because if it is further +x than our current m, all the remaining values of +x can be mathematically be proven to be further away. As we get better and better closest points, the amount of space we need to search gets smaller and smaller.

If we run out of points in a direction, that direction is finished. If the distance to a point along just that one dimension of the line is itself greater than m, that direction is finished.

The solution is m when all directions proven to only have points that must be farther than our best point so far.

-- Since we progressively reduce m, the distance in each dimension needed as a whole drops quickly, though like all algorithms it drops off less quickly in higher dimensions. But, if the distance in just one dimension is greater than the best we have thus far, it must necessarily be the case that all the rest of those points, in that direction, can't be better.

In time complexity seems on par with the better ones. But, in simplicity of the datastructures, this algorithm clearly wins. There's a lot of other properties that make this algorithm a serious contender. When you update stuff, you can resort the lists with really good performance because you are very often sorting already sorted lists or nearly sorted lists. You are iterating arrays. In actual terms of real performance most datastructures suck. Generally because of caching and how memory is laid out, we are supposed to be agnostic about such things but it matters a lot. The data right next to your current relevant data is much faster to actually access. If we already know where the point we're going to be looking for it in the lists, we can solve it even faster (since we wouldn't have to find it with a binary search). And other permitted tricks reusing the information from the previous iteration here and there. And additional dimensions are basically free (save that then the value doesn't converge faster, but that's because there's more randomly distributed points in a sphere than a circle of the same radius).


public class EuclideanNeighborSearch2D {
    public static final int INVALID = -1;
    static final Comparator<Point> xsort = new Comparator<Point>() {
        @Override
        public int compare(Point o1, Point o2) {
            return Double.compare(o1.x, o2.x);
        }
    };
    static final Comparator<Point> ysort = new Comparator<Point>() {
        @Override
        public int compare(Point o1, Point o2) {
            return Double.compare(o1.y, o2.y);
        }
    };

    ArrayList<Point> xaxis = new ArrayList<>();
    ArrayList<Point> yaxis = new ArrayList<>();

    boolean dirtySortX = false;
    boolean dirtySortY = false;

    public Point findNearest(float x, float y, float minDistance, float maxDistance) {
        Point find = new Point(x,y);

        sortXAxisList();
        sortYAxisList();

        double findingDistanceMaxSq = maxDistance * maxDistance;
        double findingDistanceMinSq = minDistance * minDistance;

        Point findingIndex = null;

        int posx = Collections.binarySearch(xaxis, find, xsort);
        int posy = Collections.binarySearch(yaxis, find, ysort);
        if (posx < 0) posx = ~posx;
        if (posy < 0) posy = ~posy;

        int mask = 0b1111;

        Point v;

        double vx, vy;
        int o;
        int itr = 0;
        while (mask != 0) {
            if ((mask & (1 << (itr & 3))) == 0) {
                itr++;
                continue; //if that direction is no longer used.
            }
            switch (itr & 3) {
                default:
                case 0: //+x
                    o = posx + (itr++ >> 2);
                    if (o >= xaxis.size()) {
                        mask &= 0b1110;
                        continue;
                    }
                    v = xaxis.get(o);
                    vx = x - v.x;
                    vy = y - v.y;
                    vx *= vx;
                    vy *= vy;
                    if (vx > findingDistanceMaxSq) {
                        mask &= 0b1110;
                        continue;
                    }
                    break;
                case 1: //+y
                    o = posy + (itr++ >> 2);
                    if (o >= yaxis.size()) {
                        mask &= 0b1101;
                        continue;
                    }
                    v = yaxis.get(o);
                    vx = x - v.x;
                    vy = y - v.y;
                    vx *= vx;
                    vy *= vy;
                    if (vy > findingDistanceMaxSq) {
                        mask &= 0b1101;
                        continue;
                    }
                    break;
                case 2: //-x
                    o = posx + ~(itr++ >> 2);
                    if (o < 0) {
                        mask &= 0b1011;
                        continue;
                    }
                    v = xaxis.get(o);
                    vx = x - v.x;
                    vy = y - v.y;
                    vx *= vx;
                    vy *= vy;
                    if (vx > findingDistanceMaxSq) {
                        mask &= 0b1011;
                        continue;
                    }
                    break;
                case 3: //-y
                    o = posy + ~(itr++ >> 2);
                    if (o < 0) {
                        mask = mask & 0b0111;
                        continue;
                    }
                    v = yaxis.get(o);
                    vx = x - v.x;
                    vy = y - v.y;
                    vx *= vx;
                    vy *= vy;
                    if (vy > findingDistanceMaxSq) {
                        mask = mask & 0b0111;
                        continue;
                    }
                    break;
            }
            double d = vx + vy;

            if (d <= findingDistanceMinSq) continue;

            if (d < findingDistanceMaxSq) {
                findingDistanceMaxSq = d;
                findingIndex = v;
            }

        }
        return findingIndex;
    }

    private void sortXAxisList() {
        if (!dirtySortX) return;
        Collections.sort(xaxis, xsort);
        dirtySortX = false;
    }

    private void sortYAxisList() {
        if (!dirtySortY) return;
        Collections.sort(yaxis,ysort);
        dirtySortY = false;
    }

    /**
     * Called if something should have invalidated the points for some reason.
     * Such as being moved outside of this class or otherwise updated.
     */
    public void update() {
        dirtySortX = true;
        dirtySortY = true;
    }

    /**
     * Called to add a point to the sorted list without needing to resort the list.
     * @param p Point to add.
     */
    public final void add(Point p) {
        sortXAxisList();
        sortYAxisList();
        int posx = Collections.binarySearch(xaxis, p, xsort);
        int posy = Collections.binarySearch(yaxis, p, ysort);
        if (posx < 0) posx = ~posx;
        if (posy < 0) posy = ~posy;
        xaxis.add(posx, p);
        yaxis.add(posy, p);
    }

    /**
     * Called to remove a point to the sorted list without needing to resort the list.
     * @param p Point to add.
     */
    public final void remove(Point p) {
        sortXAxisList();
        sortYAxisList();
        int posx = Collections.binarySearch(xaxis, p, xsort);
        int posy = Collections.binarySearch(yaxis, p, ysort);
        if (posx < 0) posx = ~posx;
        if (posy < 0) posy = ~posy;
        xaxis.remove(posx);
        yaxis.remove(posy);
    }
}

Update: With regard to, in the comments, the k-points problem. You will notice that very little changed. The only relevant thing was if the point v is found be be less than the current m (findingDistanceMaxSq), then that point is added to the heap, and the value for m is set to be equal to euclidean distance between the finding position and the kth element. The regular version of the algorithm can be seen as the case where k = 1. We search for the 1 element we want and we update m to equal the only (k=1) element when v is found to be closer.

Keep in mind, I only ever do distance comparatives in the distance squared form, since I only ever need to know if it's farther away, and I don't waste clock cycles on square root functions.

And I know there is a perfect data structure for storing the k-elements in a size limited heap. Obviously an array insertion is not optimal for that. But, other than too much java dependent apis there simply wasn't one for that particular class though apparently Google Guava makes one. But, you won't really notice at all given that odds are good your k is likely not that big. But, it does make the time complexity for an insertion in points stored in k-time. There are also things like caching the distance from the finding-point for the elements.

Finally, and likely most pressingly, the project I would use to test the code is in transition so I haven't managed to test this out. But, it certainly shows how you do this: You store the k best results thus far, and make m equal to the distance to the the kth closest point. -- All else remains the same.

Example source.

public static double distanceSq(double x0, double y0, double x1, double y1) {
    double dx = x1 - x0;
    double dy = y1 - y0;
    dx *= dx;
    dy *= dy;
    return dx + dy;
}
public Collection<Point> findNearest(int k, final float x, final float y, float minDistance, float maxDistance) {
    sortXAxisList();
    sortYAxisList();

    double findingDistanceMaxSq = maxDistance * maxDistance;
    double findingDistanceMinSq = minDistance * minDistance;
    ArrayList<Point> kpointsShouldBeHeap = new ArrayList<>(k);
    Comparator<Point> euclideanCompare = new Comparator<Point>() {
        @Override
        public int compare(Point o1, Point o2) {
            return Double.compare(distanceSq(x, y, o1.x, o1.y), distanceSq(x, y, o2.x, o2.y));
        }
    };

    Point find = new Point(x, y);
    int posx = Collections.binarySearch(xaxis, find, xsort);
    int posy = Collections.binarySearch(yaxis, find, ysort);
    if (posx < 0) posx = ~posx;
    if (posy < 0) posy = ~posy;

    int mask = 0b1111;

    Point v;

    double vx, vy;
    int o;
    int itr = 0;
    while (mask != 0) {
        if ((mask & (1 << (itr & 3))) == 0) {
            itr++;
            continue; //if that direction is no longer used.
        }
        switch (itr & 3) {
            default:
            case 0: //+x
                o = posx + (itr++ >> 2);
                if (o >= xaxis.size()) {
                    mask &= 0b1110;
                    continue;
                }
                v = xaxis.get(o);
                vx = x - v.x;
                vy = y - v.y;
                vx *= vx;
                vy *= vy;
                if (vx > findingDistanceMaxSq) {
                    mask &= 0b1110;
                    continue;
                }
                break;
            case 1: //+y
                o = posy + (itr++ >> 2);
                if (o >= yaxis.size()) {
                    mask &= 0b1101;
                    continue;
                }
                v = yaxis.get(o);
                vx = x - v.x;
                vy = y - v.y;
                vx *= vx;
                vy *= vy;
                if (vy > findingDistanceMaxSq) {
                    mask &= 0b1101;
                    continue;
                }
                break;
            case 2: //-x
                o = posx + ~(itr++ >> 2);
                if (o < 0) {
                    mask &= 0b1011;
                    continue;
                }
                v = xaxis.get(o);
                vx = x - v.x;
                vy = y - v.y;
                vx *= vx;
                vy *= vy;
                if (vx > findingDistanceMaxSq) {
                    mask &= 0b1011;
                    continue;
                }
                break;
            case 3: //-y
                o = posy + ~(itr++ >> 2);
                if (o < 0) {
                    mask = mask & 0b0111;
                    continue;
                }
                v = yaxis.get(o);
                vx = x - v.x;
                vy = y - v.y;
                vx *= vx;
                vy *= vy;
                if (vy > findingDistanceMaxSq) {
                    mask = mask & 0b0111;
                    continue;
                }
                break;
        }
        double d = vx + vy;
        if (d <= findingDistanceMinSq) continue;
        if (d < findingDistanceMaxSq) {
            int insert = Collections.binarySearch(kpointsShouldBeHeap, v, euclideanCompare);
            if (insert < 0) insert = ~insert;
            kpointsShouldBeHeap.add(insert, v);
            if (k < kpointsShouldBeHeap.size()) {
                Point kthPoint = kpointsShouldBeHeap.get(k);
                findingDistanceMaxSq = distanceSq(x, y, kthPoint.x, kthPoint.y);
            }
        }
    }
    //if (kpointsShouldBeHeap.size() > k) {
    //    kpointsShouldBeHeap.subList(0,k);
    //}
    return kpointsShouldBeHeap;
}
Tatarize
  • 10,238
  • 4
  • 58
  • 64
  • "We binary search into each of the lists (dlog(n)). We find our current minimum distance, m." Could you please elaborate on this phrase? What are we binary searching for, and how do we find the current minimum distance? – MyStackRunnethOver Aug 02 '17 at 22:35
  • Thank you! I know it's a stretch, but do you think it's feasible to extend this algorithm to K-nearest-neighbors? Seems like it would have to be done by finding the 1st nearest neighbor, saving it, then excluding it from consideration and repeating the algorithm to find the 2nd nearest neighbor, etc. Unless I'm missing something? – MyStackRunnethOver Aug 04 '17 at 18:41
  • 1
    Hm. It's an interesting idea, obviously we could store the k items in a heap (or priority queue) though, we just change the definition of m such that rather than the distance of the best currently found point, m is the farthest distance of the k-points in the heap. So the same tricks apply. We just keep a heap of the best points we found thus far, and m is the distance of the worst best points we've found. For this a heap is going to give us the best results as we need to keep kicking out the worst point in the k-items. – Tatarize Aug 05 '17 at 07:22
  • 1
    The operations of the algorithm is going to be a single pass which will be much faster than k*n*log(n) exclusion method you suggest. The hit we take is only because the k-value of those items will cause our value of m to converge more slowly so we end up considering points we wouldn't have otherwise, which is ultimately and fundamentally correct. This is much like what happens when we consider this algorithm in more dimensions. The algorithm remains exactly the same but the trick of cutting off that direction doesn't come into effect as quickly. – Tatarize Aug 05 '17 at 07:27
  • 1
    @MyStackRunnethOver See, proof of concept code. Yeah, it's a trivial change. Same core algorithm but we keep track of the top-k elements and keep m equal to the most distant element in the k-elements. Same trick applies. If the distance in any single direction is more than the distance we'd have to beat to make the current list of best elements, we can know that that point cannot make the list of top-k elements, and furthermore that no point in that direction could. – Tatarize Aug 05 '17 at 08:46
  • Wow you really went above and beyond on this, thank you! Changing the distance cutoff to the most distant point is a really nifty solution - the fact that the original algorithm would halt searching too early is exactly what tripped me up. As to the heap - I see what you're saying - I'll have to see whether linear `k` is acceptable for my application – MyStackRunnethOver Aug 06 '17 at 03:32
  • 1
    Much of the trick is to simply mathematically prove as many points as you can, can't be closer than the points you already have. This is also why the algorithm will end up checking most points in 3d like all other NNS algorithms, even the less elegant space partitioning ones. A lot of other related algorithms could be made this way using a different metric or a slight modification. – Tatarize Aug 06 '17 at 04:26
  • 1
    In practice since the k-points are likely tiny compared to the number of n points you have it's doubtful a k-time insertion is going to take longer. Saying it should be a static array heap with cached euclidean distance is just hyper-optimizing to keep the theoretical time complexity down, and might not have any real impact on actual runtimes. – Tatarize Aug 06 '17 at 04:32
  • 1
    I'll edit it again, when I get to modifying my production code. There's a number of other speed tricks (even though this was much faster than all other NNS I used for my purpose) that aren't currently there. For example, it is actually the case that when both directions in the same axis are proven fruitless, we can stop. If there are no more numbers to check in +x and -x, or +y and -y, or +z and -z, there are no more numbers to check, period. That's the *entire* number line. -- That and since each list has its own copy of the same point, if we can mark the point as checked, and not reprocess. – Tatarize Aug 06 '17 at 04:45
2

I would use a KD-tree to do this in O(log(n)) time, assuming the points are randomly distributed or you have a way to keep the tree balanced.

http://en.wikipedia.org/wiki/Kd-tree

KD trees are excellent for this kind of spatial query, and even allow you to retrieve the nearest k neighbors to a query point.

Tom
  • 18,685
  • 15
  • 71
  • 81
1

Its my understanding quadtree is for 2d, but you could calculate something for 3d thats is very similar. This will speed up your search, but it will require much more time to calculate the index if done on the fly. I would suggest calculating the index once then storing it. On every lookup you figure out all of the exterior quads then work your way in looking for hits... it would look like pealing an orange. The speed will greatly increase as the quads get smaller. Everything has a trade off.

CrazyDart
  • 3,803
  • 2
  • 23
  • 29
  • BTW if you really have tons of points in the same quad, its common to do a quad in a quad... and keep nesting to the resolution that makes sense. For 3d this could cost a lot... 2d is usually not too bad. – CrazyDart Dec 03 '10 at 22:12
  • 1
    The 3d structure is called an octree. – moinudin Dec 03 '10 at 23:16
1

Unless they are not organized in a proper data structure, the only way will be linear search.

Smi
  • 13,850
  • 9
  • 56
  • 64
ruslik
  • 14,714
  • 1
  • 39
  • 40
-1

The "Fastest" way to do it, considering the search ONLY, would be to use voxels. With a 1:1 point-voxel map, the access time is constant and really fast, just shift the coordinates to center your point origin at the voxel origin(if needed) and then just round-down the position and access the voxel array with that value. For some cases, this is a good choice. As explained before me, octrees are better when a 1:1 map is hard to get( too much points, too little voxel resolution, too much free space).

Santiago Pacheco
  • 197
  • 1
  • 13
  • What if your input data has a cluster of points that are really close together? You would then need a very fine grid. To me this doesn't seem like a sensible approach. "For some cases, this is a good choice" is so vague that it doesn't say anything at all. Do you have any link to a paper or article that uses such a voxel map? Have you ever used such an approach yourself? – Andreas Haferburg Aug 13 '19 at 20:22
-2

check this out.. You can consult CLRS computational geometry chapter also.. http://www.cs.ucsb.edu/~suri/cs235/ClosestPair.pdf

Anwit
  • 27
  • 1
  • Finding the nearest point to the current point is a different problem than finding which two points in a dataset are the closest to each other. – Tatarize Jan 08 '17 at 12:09