3

I am using foreach for parallel processing, which requires manual passing of functions via a list to the environments of addressed cores. I want to automate this process and cover all use cases. Easy for simple functions which use only enclosed variables. Complications however as soon as functions which are to be parallel processed are using arguments and variables that are defined in another environment. Consider the following case:

global.variable <- 3

global.function <-function(j){
  res <- j^2
  return(res)
}

compute.in.parallel <-function(i){
  res <- global.function(i+global.variable)
  return(res)
}

pop <- seq(10)

do <- function(pop,fun){
  require(doParallel)
  require(foreach)
  cl <- makeCluster(16)
  registerDoParallel(cl)
  clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
  results <- foreach(i=pop) %dopar% fun(i)
  stopCluster(cl)
  return(results)
}

do(pop,compute.in.parallel)

this works because I manually pass the global.variable and global.function to the cores as well (note that compute.in.parallel itself is automatically considered within the scope): clusterExport(cl,list("global.variable","global.function"),envir=globalenv())

but I want to do it automatically - requiring to build a string of all variables and functions which are used (but not defined/passed/contained) within compute.in.parallel. How do I do this?

My current workaround is dump all available variables to the cores:

clusterExport(cl,as.list(unique(c(ls(.GlobalEnv),ls(environment())))),envir=environment())

This is however non-satisfactory - I am not considering variables in package namespaces and other hidden environments as well as generally passing way too many variables to the cores, creating significant overhead with every parallel run.

Any suggested improvements?

user3641187
  • 405
  • 5
  • 10

2 Answers2

2

Just pass all arguments that are needed in do(), rather than using global variables.

compute.in.parallel <- function(i, global.variable, global.function) {
  global.function(i + global.variable)
}

do <- function(pop, fun, ncores = parallel::detectCores() - 1, ...) {
  require(foreach)
  cl <- parallel::makeCluster(ncores)
  on.exit(parallel::stopCluster(cl), add = TRUE)
  doParallel::registerDoParallel(cl)
  foreach(i = pop) %dopar% fun(i, ...)
}

do(seq(10), compute.in.parallel, 
   global.variable = 3, 
   global.function = function(j) j^2)
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Thanks for your answer - however, the idea is to keep the "do"-function as generic as possible. It will be called in various different settings, possibly also by less experienced coders (I realize that many occasional users are not aware of variable scoping issues). Therefore I would like to have an automated approach. – user3641187 Nov 09 '17 at 10:18
  • This `do()` function is more generic than yours. – F. Privé Nov 09 '17 at 10:38
  • Apologies if I don't follow correctly - but your do() function requires the explicit parametrization of any out-of-scope variables. Assume we have a new "compute.in.parallel" which suddenly relies on 2 external variables. Then I'd have to re-write do() to account for additional global variables -> do(pop,fun,global.var.1, global.var.2, global.fun...) . A procedure which correctly identifies any out-of-scope variables inside "fun" and passed them automatically to the clusters could handle any combination and amount of external elements inside fun() without the need for additional referencing. – user3641187 Nov 09 '17 at 12:27
  • Hence more generic. Wouldn't you agree..? – user3641187 Nov 09 '17 at 12:27
  • What you want to do is just a bad hack. See my edit. – F. Privé Nov 09 '17 at 13:01
  • 1
    I see what you did there, and agree that the `...` construct is indeed more generalized. However, your suggestion still runs into problems if `global.function` contained dependencies that would not be explicitly passed as parameters in any standard coding situation- for example if it used some commonly used function from an external package loaded in the global environment. I would have to explicitly pass this commonly used function as a parameter which is not really intuitive: – user3641187 Nov 10 '17 at 10:13
2

The future framework automatically identifies and exports globals by default. The doFuture package provides a generic future backend adaptor for foreach. If you use that, the following works:

do <- function(pop, fun) {
  library("doFuture")
  registerDoFuture()
  cl <- parallel::makeCluster(2)
  old_plan <- plan(cluster, workers = cl)
  on.exit({
    plan(old_plan)
    parallel::stopCluster(cl)
  })

  foreach(i = pop) %dopar% fun(i)
}
HenrikB
  • 6,132
  • 31
  • 34