7

Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other?

An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant):

xy <- cbind(rnorm(1000),rnorm(1000))

n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
    subset <- xy[sample(1:nrow(xy),n),]
    avg <- mean(dist(subset))
    if (avg > bestavg) {
        bestavg <- avg
        bestSet <- subset
    }
}
Pascal
  • 1,590
  • 2
  • 16
  • 35
  • So suppose you have 10 points, you want to find the subset of 4, say, points that maximise the sum of the 6 inter-point distances? – Spacedman Mar 03 '14 at 16:58
  • Yes, I think that would get at the result I'm looking for... – Pascal Mar 03 '14 at 17:02
  • The combinatorics are against you for 1000 points and a subset of 20. How about compute all 1000x1000 distances, drop the two closest points, recompute the distances, repeat 980 times. Quicker than iterating over 10^50 combinations. – Spacedman Mar 03 '14 at 17:06
  • 3
    I agree with @Spacedman. But perhaps this is a question better asked on the [Computer Science SO](http://cs.stackexchange.com/)? An efficient algorithm will likely not be specific to R, and the users there may already know of the best algorithm. – nograpes Mar 03 '14 at 17:19
  • 1
    You are recalculating the distance matrix at each step. Why? Just subset it based on the remaining points at that step. – jlhoward Mar 03 '14 at 17:57
  • Well, what are you going to do with these points? Sans doubt there's a plethora of literature on distance selection in random sets, but depending on what your ultimate goal is, maybe you don't even need to go through this step. – Carl Witthoft Mar 03 '14 at 19:39
  • Related to [Quora: Johnny Ho's answer to How do you calculate the largest quadrilateral area inside a convex polygon?](http://www.quora.com/How-do-you-calculate-the-largest-quadrilateral-area-inside-a-convex-polygon/answer/Johnny-Ho) – smci Mar 04 '14 at 10:37

2 Answers2

10

This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix.

m2 <- function(xy, n){

    subset <- xy

    alldist <- as.matrix(dist(subset))

    while (nrow(subset) > n) {
        cdists = rowSums(alldist)
        closest <- which(cdists == min(cdists))[1]
        subset <- subset[-closest,]
        alldist <- alldist[-closest,-closest]
    }
    return(subset)
}

Run on a Gaussian cloud, where m1 is @pascal's function:

> set.seed(310366)
> xy <- cbind(rnorm(1000),rnorm(1000))
> m1s = m1(xy,20)
> m2s = m2(xy,20)

See who did best by looking at the sum of the interpoint distances:

> sum(dist(m1s))
[1] 646.0357
> sum(dist(m2s))
[1] 811.7975

Method 2 wins! And compare with a random sample of 20 points:

> sum(dist(xy[sample(1000,20),]))
[1] 349.3905

which does pretty poorly as expected.

So what's going on? Let's plot:

> plot(xy,asp=1)
> points(m2s,col="blue",pch=19)
> points(m1s,col="red",pch=19,cex=0.8)

enter image description here

Method 1 generates the red points, which are evenly spaced out over the space. Method 2 creates the blue points, which almost define the perimeter. I suspect the reason for this is easy to work out (and even easier in one dimension...).

Using a bimodal pattern of initial points also illustrates this:

enter image description here

and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling:

> sum(dist(m1s2))
[1] 958.3518
> sum(dist(m2s2))
[1] 1206.439
> sum(dist(xy2[sample(1000,20),]))
[1] 574.34
Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Though the results that come out of method m1 are more in line with what I was looking for, technically your solution does a better job of answering the question. – Pascal Mar 06 '14 at 04:01
  • Then I think you need to think carefully about what it is you are looking for, because its not the set of points that have the maximum sum interpoint distance! Could it be the set of points A that minimises the sum of distances to the points *not* in A? That might give you something like `m1`, since it will try to evenly spread selected points among the unselected... – Spacedman Mar 06 '14 at 08:27
  • Yes, I think what you just described is exactly what I am looking for. – Pascal Mar 06 '14 at 14:12
  • I think you need to formalise that and start a new question! – Spacedman Mar 06 '14 at 14:21
  • http://stackoverflow.com/questions/22228946/choose-n-most-evenly-spread-points-across-point-dataset-in-r – Pascal Mar 06 '14 at 15:33
0

Following @Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. It seems to work well, however, it slows down pretty quickly as you add points.

xy <- cbind(rnorm(1000),rnorm(1000))

n <- 20

subset <- xy

alldist <- as.matrix(dist(subset))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA

while (nrow(subset) > n) {
    closest <- which(alldist == min(alldist,na.rm=T),arr.ind=T)
    subset <- subset[-closest[1,1],]
    alldist <- alldist[-closest[1,1],-closest[1,1]]
}
Pascal
  • 1,590
  • 2
  • 16
  • 35
  • A better approach might be to drop the point that has the smallest row sum (over the full distance matrix) - that would the point contributing most to the quantity we are trying to minimise... – Spacedman Mar 04 '14 at 10:02