0

I've been working on a simple collection of functions for my supervisor that will do some simple initial genome scale stats, that is easy to do to give my team a quick indication as to future analyses which may make more time - for example RDP4 or BioC (just to explain why I haven't just gone straight to BioConductor). I'd like to speed some things up to allow larger contig sizes so I've decided to use doParallel and foreach to edit some for loops to allow this. Below is one simple function which identifies bases in some sequences (stored as a matrix) which are identical and removes them.

strip.invar <- function(x) {
  cat("
          Now removing invariant sites from DNA data matrix, this may take some time...
      ")
  prog <- txtProgressBar(min=0, max=ncol(x), style=3)
  removals<-c()
  for(i in 1:ncol(x)){
    setTxtProgressBar(prog, i)
    if(length(unique(x[,i])) == 1) { 
      removals <- append(removals, i)
    }
  }
  newDnaMatrix <- x[,-removals]
  return(newDnaMatrix)
}

After reading the introduction to doParallel and foreach I tried to make a version to accommodate for more cores - on my mac this is 8 - a quad core with two threads per core - 8 virtual cores:

strip.invar <- function(x, coresnum=detectCores()){
  cat("
          Now removing invariant sites from DNA data matrix, this may take some time...
          ")
  prog <- txtProgressBar(min=0, max=ncol(x), style=3)
  removals<-c()
  if(coresnum > 1) {
    cl <- makeCluster(coresnum)
    registerDoParallel(cl)
    foreach(i=1:ncol(x)) %dopar% {
      setTxtProgressBar(prog, i)
      if(all(x[,i] == x[[1,i]])){
        removals <- append(removals, i)
      }
    }
  } else {
    for(i in 1:ncol(x)){
      setTxtProgressBar(prog, i)
      if(length(unique(x[,i])) == 1) { 
        removals <- append(removals, i)
      }
    }
  }
  newDnaMatrix <- x[,-removals]
  return(newDnaMatrix)
}

However if I run this and have the number of cores set to 8 I'm not entirely sure it works - I can't see the progress bar doing anything but then I've heard that printing to screen and stuff involving graphic devices is tricky with parallel computing in R. But it still seems to take some time and my laptop get's 'very' hot so I'm not sure if I've done this correctly, I've tried after seeing a few examples (I successfully ran a nice bootstrap example in the vignette), but I'm bound to hit learning bumps. As an aside, I thought I'd also ask people's opinion, what is the best speed up for R code bottlenecks where loops or apply is involved - parallelising it, or Rcpp?

Thanks.

SJWard
  • 3,629
  • 5
  • 39
  • 54
  • What about the activity monitor on your mac? That can show you the processes that are running and the amount of usage each CPU is giving. How long does it take your non-parallelised code to run? – Davy Kavanagh Mar 02 '13 at 13:16
  • 1
    Hey, I have been doing some more `foreach` coding. And I have learnt that you can redirect the output of each slave node back to the master node by using `makeCluster( (coresnum-1) , outfile = "" )`. Beware, if you have say 8 salve nodes, you will get 8 copies of output, one from each node! I use (coresnum-1) so that you take account of the fact that the master node needs one of your cores. For debugging purposes you can try doing this with 1 salve node at first. HTH. – Simon O'Hanlon Mar 07 '13 at 16:00

2 Answers2

0

Firstly, try running cl <- makeCluster( coresnum-1 ). The master R process is already using one of your cores and is used to dispatch and receive results from the slave jobs, so you have 7 free cores for the slave jobs. I think you will be effectively queuing one of your foreach loops to wait until one of the previous loops finishes and therefore the job will take longer to complete.

Secondly, what you would normally see on the console running this function in a non-parallel environment is stil printed to the console, it's just that each jobs output is printed to the slave processes console so you won't see it. You can however save the output from the different foreach loops to a text file to examine them. Here is an example of how to save console output. Stick the code there inside your foreach statement.

Your laptop will get very hot because all of your cores are working at 100% capacity while you are running this job.

I have found the foreach package to provide an excellent set of functions to provide simple parallel processing. Rcpp may (will?!) give you much greater performance, but how are you at writing C++ code and what is the runtime of this function and how often will it be used? I always think about these things first.

Community
  • 1
  • 1
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • Thanks, for your reply and the explanation. To answer your question at how I am with C++ code - I basically came to R as a undergrad biologist to use it for stats, then began treating R more like a language as we needed to develop our own stuff, like simple functions but also I'm developing quite a complex individual based model project. I am in the process of learning C++ I'm fairly comfortable with what I've covered so far. If possible I want to get both these functions (easier) and the simulation (harder) to be able to run on a normal computer. The simulation at the minute needs a cluster. – SJWard Mar 02 '13 at 14:08
  • @Axolotl9250 A pleasure. The questions I posed were more rhetorical in the sense that I think that these are things you need to think about before you decide how to proceed. As Roland pointed out in his answer which has now disappeared (?!) you should also think about how you can vectorise your code first to make it more efficient. `for` loops are generally vilified by the R community for being inefficient. Try one of the `apply` functions or `[` subsetting instead if you can. – Simon O'Hanlon Mar 02 '13 at 14:46
  • I realise I'm aproaching this already having my best/first shot at solving a problem with code and then coming back to it trying to speed it up - especially with the simulation. This my first serious coding project beyond loading in some csv files from experiment and doing stats etc. We never thought the simulation project would evolve as it has - adaptable to many different questions. Some stuff I knew how to vectorize or do as simply as possible straight away, but with growth came performance costs - there are two main bottlenecks now. I digress though - thank you for the advice! – SJWard Mar 02 '13 at 19:50
0

My other answer was not correct, since the colmean being equal to the first value is not sufficient as a test for the number of unique values. So here is another answer:

You can optimize the loop by using apply.

set.seed(42)
dat <- matrix(sample(1:5,1e5,replace=TRUE),nrow=2)
res1 <- strip.invar(dat)


strip.invar2 <- function(dat) {
  ix <- apply(dat,2,function(x) length(unique(x))>1)
  dat[,ix]}

res2 <- strip.invar2(dat)

all.equal(res1,res2)
#TRUE
library(microbenchmark)
microbenchmark(strip.invar(dat),strip.invar2(dat),times=10)
#Unit: milliseconds
#             expr       min        lq    median       uq      max neval
#strip.invar(dat)  2514.7995 2529.2827 2547.6751 2678.464 2792.405    10
#strip.invar2(dat)  933.3701  945.5689  974.7564 1008.589 1018.400    10

This improves performance quite a bit, though not as much as you could achieve if vectorization was possible.

Parallelization won't give better performance here, since each iteration does not require much performance on is own, so parallelization overhead will actually increase the time needed. However, you could split the data and process chunks in parallel.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • I see, I was thinking of ways to vectorize this and I kept thinking apply was the way to go, I was playing about and asking email lists got this: `newDnaMatrix <- dnaMatrix[,apply(dnaMatrix,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))]` which to my understanding subsets the data based on checking if all values are the same or different, by removing the first element of a col and checking the values the same values excluding last element of a col, if they are all the same values then there would be all FALSE and not subsetted, but if some differ, there will be some TRUES and will be sub. – SJWard Mar 02 '13 at 17:44
  • Running the above line on a matrix of 3 x 398508 long sequences the system time is down now to `user system elapsed 6.222 0.014 6.234`, my original loop clocked `user system elapsed 565.041 70.534 657.006` – SJWard Mar 02 '13 at 17:47