2

I've an R code with two nested optimisations. There's an outer and an inner function. The outer function passed certain parameters along to the inner function, which performs an optimisation on another set of parameters. These parameters are then sent to the outer function which optimizes an objective function based on the parameters estimated in the inner function. The estimates on the outer function are then passed to the inner function, which finds the new optimal set of parameters in the inner function, and passes them to the outer function. These loops repeat until the objective function in the outer loop is minimised.

The code works by setting the inner parameters as global variables, so that after the maximisation in the outer loop, the code passes these global variables to the inner loop.

I would like to run this procedure for different datasets in parallel. I understand that I cannot use the global variables in parallel, and I was thinking of saving text files with different file names at each loop: I would save a file with the value of the parameters at the end of the outer loop and reopen it at the beginning of the outer loop. However, is there a more efficient way to do this? I don't think that using a list would work. Thank you.

Example:

require(nloptr)
y = rnorm(100)
x = runif(100)*5

inner <- function(beta) mean((y-beta*x)^2)

outer <- function(alpha) {

  if (!exists("storage") | is.null(storage$solution))
    beta <- runif(1)
  else
    beta <- storage$solution

  sol.inner <-nloptr(
    x0     = beta,
    eval_f = inner,
    opts   = list(
      algorithm = "NLOPT_LN_BOBYQA",
      ftol_rel  = 1.e-6,
      ftol_abs  = 1.e-7,
      xtol_rel  = 1.e-6, 
      xtol_abs = 0,
      maxeval = 1000000
    )
  ) 

  storage <-  c()
  storage <<- append(storage,sol.inner)
  beta    <-  sol.inner$solution

  mean(x^2 - alpha* x + beta)^2

}

alpha0    <- runif(1)
storage   <- c()

sol.outer <- nloptr(
  x0     = alpha0,
  eval_f = outer,
  opts   = list(
    algorithm ="NLOPT_LN_BOBYQA",
    ftol_rel  = 1.e-6,
    ftol_abs  = 1.e-7,
    xtol_rel  = 1.e-6,
    xtol_abs  = 0,
    maxeval   = 1000000
  )
) 

sol.outer
gvegayon
  • 812
  • 12
  • 23
Andrew
  • 678
  • 2
  • 9
  • 19
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 20 '19 at 18:18
  • Are the outer and inner function optimizing along the same parameters? From the the post it appears that outer = f(a,b,c,d,e) and inner = f(x,y,z) such that at least one of a,b,c,d,e is a function of (x,y,z)? – Gautam Nov 20 '19 at 19:12
  • @Gautam, yes you are right! The code will converge to the true a,b,c,d,e and x,y,z values. I'll try to work out an example – Andrew Nov 20 '19 at 19:56
  • @MrFlick - I added an example. – Andrew Nov 20 '19 at 20:51

1 Answers1

2

While very neat, I don't recommend using the <<- operator in general. If you want to modify elements within functions so that you can use them once the function has exited, I recommend you using environments instead.

The thing with parallel processing is that, as implemented in the parallel package, each thread/offspring/child is running in its own session, meaning that those do not interact with each other. In that scenario, you can do pretty much what you want within each offspring process. Here is an example of what you are trying to do:


# Simulating 4 random datasets
set.seed(131)
datasets <- replicate(4, {
  list(
    y = rnorm(100),
    x = runif(100)*5
  )
}, simplify = FALSE)


inner <- function(beta, x, y) mean((y-beta*x)^2)

outer <- function(alpha, storage, x, y) {

  if (!length(storage$solution))
    beta <- runif(1)
  else
    # Take the first value, which is the latest to be
    # stored (see below)
    beta <- storage$solution[[1]]

  sol.inner <- nloptr(
    x0     = beta,
    eval_f = inner,
    opts   = list(
      algorithm = "NLOPT_LN_BOBYQA",
      ftol_rel  = 1.e-6,
      ftol_abs  = 1.e-7,
      xtol_rel  = 1.e-6, 
      xtol_abs = 0,
      maxeval = 1000000
    ),
    y = y,
    x = x
  ) 

  # We can append the latest beta as a list
  storage$solution <- c(list(sol.inner$solution), storage$solution)
  beta    <-  sol.inner$solution

  mean(x^2 - alpha* x + beta)^2

}

# Parallel solution with PSOCKcluster --------------------
library(parallel)

# Setting up the cluster object
cl <- makePSOCKcluster(4)

# We need to export the objects we plan to use within
# each session this includes loading the needed packages
clusterExport(cl, c("outer", "inner"))
invisible(clusterEvalQ(cl, library(nloptr)))
invisible({
  clusterEvalQ(cl, {
    # Be careful about random numbers in parallel!
    # This example is not reproducible right now
    alpha0    <- runif(1)

    # This should be an environment, which is easier to handle
    storage   <- new.env() 
  })
})



# You can send data to the offspring sessions and 
# these will be evaluated in separate R sessions 
ans <- parLapply(cl, datasets, function(d) {

  # Making the variables available to the program
  y <- d$y
  x <- d$x

  sol.outer <- nloptr(
    x0     = alpha0,
    eval_f = outer,
    opts   = list(
      algorithm ="NLOPT_LN_BOBYQA",
      ftol_rel  = 1.e-6,
      ftol_abs  = 1.e-7,
      xtol_rel  = 1.e-6,
      xtol_abs  = 0,
      maxeval   = 1000000
    ),
    x = d$x,
    y = d$y,
    # Passing the environment as an extra
    #  argument to the function
    storage = storage
  ) 

  list(
    sol     = sol.outer,
    storage = storage
  )

})

# Stopping the R sessions
stopCluster(cl)

# Checking out the storage vectors
lapply(ans, function(x) unlist(x$storage$solution))
#> [[1]]
#>  [1] -0.04112901 -0.04112901 -0.04112901 -0.04112901 -0.04112901
#>  [6] -0.04112901 -0.04112901 -0.04112901 -0.04112901 -0.04112901
#> [11] -0.04112901 -0.04112901 -0.04112901 -0.04112901 -0.04112901
#> [16] -0.04112901
#> 
#> [[2]]
#>  [1] -0.06877397 -0.06877397 -0.06877397 -0.06877397 -0.06877397
#>  [6] -0.06877397 -0.06877397 -0.06877397 -0.06877397 -0.06877397
#> [11] -0.06877397 -0.06877397 -0.06877397 -0.06877397 -0.06877397
#> [16] -0.06877397 -0.06877397
#> 
#> [[3]]
#>  [1] 0.004505708 0.004505708 0.004505708 0.004505708 0.004505708
#>  [6] 0.004505708 0.004505708 0.004505708 0.004505708 0.004505708
#> [11] 0.004505708 0.004505708 0.004505708 0.004505708 0.004505708
#> [16] 0.004505708 0.004505708
#> 
#> [[4]]
#>  [1] -0.02001445 -0.02001445 -0.02001445 -0.02001445 -0.02001445
#>  [6] -0.02001445 -0.02001445 -0.02001445 -0.02001445 -0.02001445
#> [11] -0.02001445 -0.02001445 -0.02001445 -0.02001445 -0.02001445
#> [16] -0.02001445

Created on 2019-11-20 by the reprex package (v0.3.0)

One thing to notice here is that I modified your functions so that arguments are passed explicitly, hence we will not be dealing with scoping in this case. This is usually safer and more recent versions of R are smart enough to avoid copying objects when passed to functions.

One last bit to point out is that, if your datasets are large, it is better to actually load them within the offspring sessions to avoid duplicating memory (In general this last point is not a problem if you use makeForkCluster, but this is only available for unix based systems).

gvegayon
  • 812
  • 12
  • 23
  • Thank you, your solution looks really good! Do you have a reference to `makeForkCluster` and `makePSOCKcluster`? Can I use them with unix LSF scheduler? Another question: what do you exactly mean by "This example is not reproducible right now"? – Andrew Nov 21 '19 at 09:16
  • 1
    I usually prefer Fork clusters. Certainly you should be able to use it in a unix sys. I mean that the example is not reproducible as I did not set a seed for the sessions. You should read more about it in the parallel package – gvegayon Nov 21 '19 at 15:16
  • I've been reading on fork and socket clusters. is there a particular reason why you set your solution as a socket? Another question I have is that, on my real data, the code only works if `cl` has the same number of cores as the number of processes (4 in the datasets you created). If # cores < # processes the console displays: `Error in checkForRemoteErrors: one node produced an error: non-conformable arguments`. Would you have any idea why is that? I can't replicate the error, but it's as if the process starting after the first `cl` processes don't find the data. Thank you so much! – Andrew Nov 24 '19 at 14:33
  • Not sure about the second problem. Are you sure you are passing a list of data.frame objects? Check running something like `parLapply(cl, datasets, class)` to make sure the child sessions are recieving a df. I use socket cluster in the example b/c fork clusters don't work on every OS. – gvegayon Nov 24 '19 at 18:55