I'm trying to figure out how to pass functions and packages to the boot()
function when running parallel computations. It seems very expensive to load a package or define functions inside a loop. The foreach()
function that I often use for other parallel tasks has a .packages and .export arguments that handles this (see this SO question) in a nice way but I can't figure out how to do this with the boot package.
Below is a meaningless example that shows what happens when switching to parallel:
library(boot)
myMean <- function(x) mean(x)
meaninglessTest <- function(x, i){
return(myMean(x[i]))
}
x <- runif(1000)
bootTest <- function(){
out <- boot(data=x, statistic=meaninglessTest, R=10000, parallel="snow", ncpus=4)
return(boot.ci(out, type="perc"))
}
bootTest()
Complains (as expected) about that it can't find myMean
.
Sidenote: When running this example it runs slower than one-core, probably because splitting this simple task over multiple cores is more time consuming than the actual task. Why isn't the default to split into even job batches of R/ncpus
- is there a reason why this isn't default behavior?
Update on the sidenote: As Steve Weston noted, the parLapply that boot() uses actually splits the job into even batches/chunks. The function is a neat wrapper for clusterApply:
docall(c, clusterApply(cl, splitList(x, length(cl)), lapply,
fun, ...))
I'm a little surprised that this doesn't have a better performance when I scale up the the number of repetitions:
> library(boot)
> set.seed(10)
> x <- runif(1000)
>
> Reps <- 10^4
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="no")
> Sys.time()-start_time
Time difference of 0.52335 secs
>
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="snow", ncpus=4)
> Sys.time()-start_time
Time difference of 3.539357 secs
>
> Reps <- 10^5
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="no")
> Sys.time()-start_time
Time difference of 5.749831 secs
>
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="snow", ncpus=4)
> Sys.time()-start_time
Time difference of 23.06837 secs
I hope that this is only due to the very simple mean function and that more complex cases behave better. I must admit that I find it a little disturbing as the cluster initialization time should be the same in the 10.000 & 100.000 case, yet the absolute time difference increases and the 4-core version takes 5 times longer. I guess this must be an effect of the list merging, as I can't find any other explanation for it.