sapply, mclapply, or nested loop? Objective: fastest processing time

Question

Hello and thank you all for looking at my question.

The ultimate goal of this post is to identify my fastest option to input specific distance values, using row and column names that identify the respective spatial location from a small symmetrical data frame (dist.data), into a large symmetrical data frame (final.data) whose row and column names represent the specific observation (There are some observations that are located at the same location which is why the dimensions of the two data frames are different). I am considering sapply, mclapply, and a nested for loop, however, I am open to all suggestions. I would like to find the fastest option.

I got the sapply and nested for loop to work and found that the nested loop was 2X faster. However, I was unsuccessful getting the mclapply to work.

#preliminary set up for reproducible example
set.seed(41)

# final df; used in the nested for loop
final.data<-matrix(NA,nrow=100,ncol=100)
  rownames(final.data)<-seq(1:100)
  colnames(final.data)<-rownames(final.data)


#make a symetrical 100 X 100 matrix
dist.data <- matrix(rep(0,10000), nrow=100)
dist.data[lower.tri(dist.data)] <- seq(from=1,to=choose(10,2),by=1)
dist.data <- t(dist.data)
dist.data[lower.tri(dist.data)] <- seq(from=1,to=choose(10,2),by=1)
rownames(dist.data)<-seq(1:100)
colnames(dist.data)<-rownames(dist.data)


# spatial id of each person;allows multiples
spat.ID.test<-sample(1:100, 100, replace=TRUE)

using sapply

dummy <- function(row, column){
  return(dist.data[spat.ID.test[row],spat.ID.test[column]])
} 
ptm <- proc.time()
final.data<-as.data.frame(sapply(1:100,function(row) sapply(1:100, function(column) dummy(row,column))))
proc.time() - ptm

using mclapply

numCores <- detectCores()
dummy <- function(row, column){
  return(dist.data[spat.ID.test[row],spat.ID.test[column]])
} 
ptm <- proc.time()
final.data<-as.data.frame(mclapply(1:100, function(row) mclapply(1:100, function(column) dummy(row,column),mc.cores = numCores),mc.cores=numCores))
proc.time() - ptm

using a nested for loop

ptm <- proc.time()
for (row in 1:100){
  for (column in 1:100){
    #270 is the column for spatialID
    y1<- spat.ID.test[row]   #identifies the spatialID,  in df.full, for the row's respective observation (max of 7079 i.e. the # of unique spatialID)
    x1<- spat.ID.test[column] #identifies the spatialID for the columns's respective observation
    final.data[row,column]=dist.data[y1,x1]    
    }
  }
proc.time() - ptm

Thank you!!

Note: since the output will also be a symmetric matrix it is possible to solve for the lower (upper) triangle and then transpose it to the upper (lower) triangle. To do this I set the upper limit of column to row. However, I am not sure about the best way to transpose it.

If you're looking for the fastest processing time, nested loop using `Rcpp` is probably unbeatable, see https://stackoverflow.com/a/62841712/13513328 — Waldi, Sep 05 '20 at 21:21
Not really sure about the unbeatable ( nested loop in C and .Call() is probably faster). But I'd also suggest the Rcpp solution, as this is quite convenient. But just do some tests with "microbenchmark" to find a winner :) — Steffen Moritz, Sep 06 '20 at 01:06

sapply, mclapply, or nested loop? Objective: fastest processing time

0 Answers0