1

We are given a huge set of points in 2D plane. We need to find, for each point the closest point within the set. For instance suppose the initial set is as follows:

 foo <- data.frame(x=c(1,2,4,4,10),y=c(1,2,4,4,10))

The output should be like this:

 ClosesPair(foo)
 2
 1
 4
 3
 3 # (could be 4 also)

Any idea?

madth3
  • 7,275
  • 12
  • 50
  • 74
Ali
  • 9,440
  • 12
  • 62
  • 92
  • Similar question: http://stackoverflow.com/questions/16474179/how-to-calculate-euclidean-distance-and-save-only-summaries-for-large-data-fra/16474415#16474415 – flodel May 21 '13 at 10:43

3 Answers3

4

The traditional approach is to preprocess the data and put it in a data structure, often a K-d tree, for which the "nearest point" query is very fast.

There is an implementation in the nnclust package.

library(nnclust)
foo <- cbind(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
i <- nnfind(foo)$neighbour
plot(foo)
arrows( foo[,1], foo[,2], foo[i,1], foo[i,2] )
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
1

Here is an example; all wrapped into a single function. You might want to split it a bit for optimization.

ClosesPair <- function(foo) {
  dist <- function(i, j) {
    sqrt((foo[i,1]-foo[j,1])**2 + (foo[i,2]-foo[j,2])**2)
  }

  foo <- as.matrix(foo)

  ClosestPoint <- function(i) {  
    indices <- 1:nrow(foo)
    indices <- indices[-i]

    distances <- sapply(indices, dist, i=i, USE.NAMES=TRUE)

    closest <- indices[which.min(distances)]
  }

  sapply(1:nrow(foo), ClosestPoint)
}
ClosesPair(foo)
# [1] 2 1 4 3 3

Of cause, it does not handle ties very well.

MrGumble
  • 5,631
  • 1
  • 18
  • 33
  • Thanks, but will it work for a huge set of points, i mean for example 1M points? – Ali May 21 '13 at 10:28
  • 1
    Yes, but efficiently? Not sure. It will run in O(n^2) as for each point it needs to re-calculate all distances, so if you want the optimum solution, there is no way to get around this. Space-wise, which is usually the more pressing issue, it only uses O(n). If you need to speed things up, you can look into parallel processing, which this function easily can be rewritten as, or a heuristic search algorithm, which heavily depends on your data. – MrGumble May 21 '13 at 10:30
  • 2
    You are not taking advantage of the fact your `dist` function is already vectorized... Replace `sapply(indices, dist, i=i, USE.NAMES=TRUE)` with `dist(i, indices)` and you will see a huge improvement. You can also make things faster by using indices more efficiently, for example, `indices <- 1:nrow(foo)` should not be computed inside the loop. And replace `sapply` with `vapply`. – flodel May 21 '13 at 10:50
0

Use the package spatstat . It's got builtin functions to do this sort of stuff.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73