0

Here's a simplified version of the code I'm working on:

# Randomly populated matrices as example/toy data.

adata <- sample(1:25,2500,replace=T)
xdim <- 50
ydim <- 50
dim(adata) <- c(xdim,ydim)

# All possible categories
cats <- 1:25

# Retrieve the coordinates of the members of each class
# These subsets are used to calculate the minimum distances.

subA <- list(NULL)

for(i in 1:length(cats)){
  subA[[i]] <- which(adata==cats[i],arr.ind=T)
}


# Matrix containing Euclidean distances for each combination of dX and dY
eucdist <- matrix(nrow=xdim,ncol=ydim)
eucdist <- sqrt(((row(eucdist)-1)^2)+((col(eucdist)-1)^2))

# Create an array containing the x,y coordinates of each cell (in pixels). Used to  calculate distances

xycoor <- array(dim=c(xdim,ydim,2))
xycoor[,,1] <- row(xycoor[,,1])
xycoor[,,2] <- col(xycoor[,,2])


# For each cell, find the minimum distance to a cell belonging to each category

mindistA <- array(dim=c(xdim,ydim,length(cats)))

for(i in 1:length(cats)){
  # Calculates an array of the distance between each cell and the nearest cell of each class (of the same map)

  mindistA[,,i] <- apply(xycoor,1:2,function(x,y,z) min(eucdist[abs(sweep(subA[[i]],2,x,"-"))+1],na.rm=T))
}

The aim of this is to:

  • create a list containing, for each class contained in cats, the row and column numbers of the cells corresponding to that class
  • for each cell of my matrix, calculate the distance to the closest cell corresponding to each class.

This works well with my toy data, but I'm currently running this with larger datasets (I have matrices of roughly 1000*2000 elements instead of 50*50 as in the example here) and it is very slow and would probably take days to complete. Based on this question, I suspect that trying to access elements of eucdist using an expression as index is what is slowing down the execution.

A possible solution would be to 1) create an array of dimensions length(subA[[i]]) * 50 * 50 * 2, containing the distances between the X and Y coordinates (respectively) of each grid cell and of each cell corresponding to class i, and 2) use the contents of this array to look up the desired value in 'eucdist' (possibly using rownames and colnames, as suggested in the comments to the question linked above). However, I'm afraid that in my case, this will generate arrays that are too large for R to handle.

Does anyone have a suggestion for speeding this up? Any advice would be greatly appreciated!

Thanks!

PS: I previously tried calculating the euclidean distance directly from within the apply function using rdist from the fields package, but this was also very slow and the reason why I decided to use a lookup table of euclidean distances instead.

Community
  • 1
  • 1
m.chips
  • 173
  • 9
  • 1
    If you need help on improving working code, http://codereview.stackexchange.com is the right place. – Sven Hohenstein Jan 22 '14 at 15:04
  • I'd look around for a Euclidean distance function in `R` packages (use, e.g. `sos::???`) . You also could investigate `parallel::mcapply` to do multicore operations. – Carl Witthoft Jan 22 '14 at 15:48
  • 1
    Any approach that relies on looking up the distance to all points in a category inherently needs to compute (or look up) the distance between all pairs of points. In your 1000 x 2000 example, you're talking about 4 trillion lookups, which will never be fast. There are data structure for speeding up nearest-neighbor computations -- I would check out k-d trees and quadtrees. – josliber Jan 22 '14 at 16:40
  • @SvenHohenstein Thanks for the tip, that's good to know, I wasn't aware of that page. – m.chips Jan 23 '14 at 10:15
  • @CarlWitthoft With the number of cores that I have I doubt that parallelizing the job would make a big difference, but thanks. – m.chips Jan 23 '14 at 10:19
  • @josilber Looks like I was indeed overly optimistic with my approach. I'll have a look at those data structures, thanks. – m.chips Jan 23 '14 at 10:20

0 Answers0