3

I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).

I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.

Thanks in advance (and for your patience).

require(proxy)

df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))

min.dist<-function(df) {  
 #df for results
 all.min.dist<-data.frame()
 #set up for loop 
 for(k in 1:nrow(df)) {
     #calcuate dissimilarity between each row and all other rows
     df.dist<-dist(df[k,],df[-k,])
     # find minimum distance
     min.dist<-min(df.dist)
     # get rowname for minimum distance (id of nearest point)
     closest.row<-row.names(df)[-k][which.min(df.dist)]
     #combine outputs
     all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
     dist=min.dist, closest_row=closest.row))
    }
 #return results
 return(all.min.dist)
                        } 
 #example
 min.dist(df)
nickb
  • 319
  • 2
  • 12
  • 1
    I'm not qualified to comment on vectorization, but you'll get some benefit by finding the minimum of the squares of distances, then take the square root only on return. – Mike Woolf May 10 '13 at 02:26
  • have you checked http://stackoverflow.com/questions/3029639/calculating-all-distances-between-one-point-and-a-group-of-points-efficiently-in?rq=1 ? – Ferdinand.kraft May 10 '13 at 02:29
  • `all.min.dist <- rbind(all.min.dist, ...)` inside your loop is very bad, as it is creating a larger object at every iteration. Read about *pre-allocating*. – flodel May 10 '13 at 02:33
  • Thanks for the suggestion re pre-allocating. Have read up on that and also see how it's incorporated into the answer from @flodel – nickb May 10 '13 at 04:26

2 Answers2

3

This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.

min.dist <- function(df) {

  which.closest <- function(k, df) {
    d <- colSums((df[, -k] - df[, k]) ^ 2)
    m <- which.min(d)
    data.frame(orig_row    = row.names(df)[k],
               dist        = sqrt(d[m]),
               closest_row = row.names(df)[-k][m])
  }

  do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}

If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.

Edit: Also read https://stackoverflow.com/a/16670220/1201032

Community
  • 1
  • 1
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Thanks flodel, this picks up on both the vectorisation and the fact that the calculation of euclidean distances directly is also straightforward. It is still fairly slow so might explore computing multiple points at a time. – nickb May 10 '13 at 06:25
0

Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).

It seems that the function dist {stats} answers your question spot on:

Description This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

Guillaume
  • 1,277
  • 2
  • 13
  • 21
  • 1
    OP said his use case is a matrix with 150k rows. `dist` would try to return a full matrix of 150k-by-150k and choke... Also the OP does not care for that much data, he only wants the closest point for each point. Hence the one-row-at-a-time approach which is much more memory-efficient. – flodel May 10 '13 at 03:21