I've been working on a simple collection of functions for my supervisor that will do some simple initial genome scale stats, that is easy to do to give my team a quick indication as to future analyses which may make more time - for example RDP4 or BioC (just to explain why I haven't just gone straight to BioConductor). I'd like to speed some things up to allow larger contig sizes so I've decided to use doParallel and foreach to edit some for loops to allow this. Below is one simple function which identifies bases in some sequences (stored as a matrix) which are identical and removes them.
strip.invar <- function(x) {
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
After reading the introduction to doParallel and foreach I tried to make a version to accommodate for more cores - on my mac this is 8 - a quad core with two threads per core - 8 virtual cores:
strip.invar <- function(x, coresnum=detectCores()){
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
if(coresnum > 1) {
cl <- makeCluster(coresnum)
registerDoParallel(cl)
foreach(i=1:ncol(x)) %dopar% {
setTxtProgressBar(prog, i)
if(all(x[,i] == x[[1,i]])){
removals <- append(removals, i)
}
}
} else {
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
However if I run this and have the number of cores set to 8 I'm not entirely sure it works - I can't see the progress bar doing anything but then I've heard that printing to screen and stuff involving graphic devices is tricky with parallel computing in R. But it still seems to take some time and my laptop get's 'very' hot so I'm not sure if I've done this correctly, I've tried after seeing a few examples (I successfully ran a nice bootstrap example in the vignette), but I'm bound to hit learning bumps. As an aside, I thought I'd also ask people's opinion, what is the best speed up for R code bottlenecks where loops or apply is involved - parallelising it, or Rcpp?
Thanks.