R foreach with bigmemory takes more time more time than serial execution

Question

I'm working with a matrix of large size(1.5GB and growing). One particular function is taking a lot of time and it seems like a good candidate for parallelization using foreach package. I'm running this with registerDoParallel(cores=4) on Ubuntu with 4cores and 8GB RAM. When I use foreach, I understand that a copy of big matrix will be made for all 4 processes. Quickly I see that memory usage reaches 100%. I read another post where the suggestion was to use bigmemory and use attach.big.matrix() so process can share the same matrix. I definitely have enough RAM to hold a copy in memory. But when I do this, I notice that the time taken to execute actually increased.

user    system   elapsed

9889.944  185.590 2670.001    - DoParallel with 4 cores!

8931.887   92.214 4526.306    - DoParallel with 2 cores!

9320.523  150.122 9473.165    - DoParallel with 1 core!

1314.037    6.236 1320.290    - Serial execution without foreach and without big.matrix.

I was not able to come up with an explanation for this. Below is my code. I've tried a few other things like sending a block to each process(foreach by default seems to do that too). Nothing could get execution faster than serial. I do see all 4 cores used at 100% when I set cores to 4. There does seem to be an improvement over 1 core using bix.matrix but no improvement over serial execution which never used big matrix.

calcQIDiffForRow <- function(row, Desc){
  mat <- attach.big.matrix(Desc)
  x <- mat[row, ]
  for(j in 1:(row - 1)) {
    y <- mat[j, ]
    ...
  }
  return(val)
}

calcQIDiff <- function(mat){
  registerDoParallel(cores=4)
  desc <- describe(mat)
  ret <- foreach(i = 1:nrow(mat), .combine=rbind, .multicombine=T, 
                 .noexport=c("mat")) %dopar% calcQIDiffForRow(i, Desc)
  return(ret)
}

system.time(QIdiff.parallel <- calcQIDiff(as.big.matrix(bigmatrix))).

In general , if the parallezid task is relatively fast , "parallellizing" will not give you any performance enhancement. — agstudy, Jun 09 '14 at 20:15
I'm doing the same task with different values of i and each are independent. That should generally be a good parallel candidate. BTW, I tried the same code on a Windows box that had 24GB RAM. So I was able to let each process have its copy of matrix. Elapsed time came down to 412s. I don't know where I'm loosing time on Ubuntu+bigmemory. — Cheeko, Jun 09 '14 at 21:29

R foreach with bigmemory takes more time more time than serial execution

0 Answers0