Closest pair for any of a huge number of points

Question

We are given a huge set of points in 2D plane. We need to find, for each point the closest point within the set. For instance suppose the initial set is as follows:

 foo <- data.frame(x=c(1,2,4,4,10),y=c(1,2,4,4,10))

The output should be like this:

 ClosesPair(foo)
 2
 1
 4
 3
 3 # (could be 4 also)

Any idea?

Similar question: http://stackoverflow.com/questions/16474179/how-to-calculate-euclidean-distance-and-save-only-summaries-for-large-data-fra/16474415#16474415 — flodel, May 21 '13 at 10:43

score 4 · Accepted Answer · answered May 21 '13 at 12:38

4

The traditional approach is to preprocess the data and put it in a data structure, often a K-d tree, for which the "nearest point" query is very fast.

There is an implementation in the nnclust package.

library(nnclust)
foo <- cbind(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
i <- nnfind(foo)$neighbour
plot(foo)
arrows( foo[,1], foo[,2], foo[i,1], foo[i,2] )

answered May 21 '13 at 12:38

Vincent Zoonekynd

31,893
5
69
78

nnclust is no longer available in the cran library. The last available version in the archive is from 2010. – ClimateUnboxed May 02 '18 at 15:24

score 1 · Answer 2 · answered May 21 '13 at 10:22

1

Here is an example; all wrapped into a single function. You might want to split it a bit for optimization.

ClosesPair <- function(foo) {
  dist <- function(i, j) {
    sqrt((foo[i,1]-foo[j,1])**2 + (foo[i,2]-foo[j,2])**2)
  }

  foo <- as.matrix(foo)

  ClosestPoint <- function(i) {  
    indices <- 1:nrow(foo)
    indices <- indices[-i]

    distances <- sapply(indices, dist, i=i, USE.NAMES=TRUE)

    closest <- indices[which.min(distances)]
  }

  sapply(1:nrow(foo), ClosestPoint)
}
ClosesPair(foo)
# [1] 2 1 4 3 3

Of cause, it does not handle ties very well.

answered May 21 '13 at 10:22

MrGumble

5,631
1
18
33

Thanks, but will it work for a huge set of points, i mean for example 1M points? – Ali May 21 '13 at 10:28
1

Yes, but efficiently? Not sure. It will run in O(n^2) as for each point it needs to re-calculate all distances, so if you want the optimum solution, there is no way to get around this. Space-wise, which is usually the more pressing issue, it only uses O(n). If you need to speed things up, you can look into parallel processing, which this function easily can be rewritten as, or a heuristic search algorithm, which heavily depends on your data. – MrGumble May 21 '13 at 10:30
2

You are not taking advantage of the fact your `dist` function is already vectorized... Replace `sapply(indices, dist, i=i, USE.NAMES=TRUE)` with `dist(i, indices)` and you will see a huge improvement. You can also make things faster by using indices more efficiently, for example, `indices <- 1:nrow(foo)` should not be computed inside the loop. And replace `sapply` with `vapply`. – flodel May 21 '13 at 10:50

score 0 · Answer 3 · answered May 21 '13 at 11:30

0

Use the package spatstat . It's got builtin functions to do this sort of stuff.

answered May 21 '13 at 11:30

Carl Witthoft

20,573
9
43
73

Closest pair for any of a huge number of points

3 Answers3

Linked