0

I'm doing optimization in R. My problem involves running nlm on an objective function which loops over a large list of data. I'd like to speed up the optimization by running the objective function in parallel. How should I go about doing that?

In the example below I set up a toy problem in which the parallelized solution is slower than the original. How do I modify the code to reduce overhead and speed up the parallelized version of my nlm call?

library(parallel)

## What is the right way to do optimization when the objective function is run in parallel?
## Don't want very_big_list to be copied more than necessary

set.seed(952)

my_objfn <- function(list_element, parameter) {
    return(sum((list_element - parameter) ^ 2))  # Simple example
}

apply_my_objfn_in_parallel <- function(parameter, very_big_list, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    objfn_values <- parLapply(cluster, very_big_list, my_objfn, parameter=parameter)
    stopCluster(cluster)
    return(Reduce("+", objfn_values))
}

apply_my_objfn <- function(parameter, very_big_list) {
    objfn_values <- lapply(very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

my_big_list <- replicate(2 * 10^6, sample(seq_len(100), size=5), simplify=FALSE)
parameter_guess <- 20
mean(c(my_big_list, recursive=TRUE))  # Should be close to 50
system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 84.2 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 63.6 elapsed

I ran this on my laptop (4 CPUs, so the cluster returned by makeCluster(min(max_cores, detectCores() - 1)) has 3 cores). In the last lines above, apply_my_objfn_in_parallel takes longer than apply_my_objfn. I think this is because (1) I only have 3 cores and (2) each time nlm calls the parallelized objective function, it sets up a new cluster and breaks up and copies all of my_big_list. That seems wasteful -- would I get better results if I somehow set up the cluster and copied the list only once per nlm call? If so, how do I do that?


Edit after Erwin's answer ("consider creating and stopping the cluster once instead of in each evaluation"):

## Modify function to use single cluster per nlm call
apply_my_objfn_in_parallel_single_cluster <- function(parameter, very_big_list, my_cluster) {
    objfn_values <- parLapply(my_cluster, very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

run_nlm_single_cluster <- function(very_big_list, parameter_guess, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    nlm_result <- nlm(apply_my_objfn_in_parallel_single_cluster, parameter_guess,
                      very_big_list=very_big_list, my_cluster=cluster, print.level=0)
    stopCluster(cluster)
    return(nlm_result)
}

system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 49.0 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 36.8 elapsed
system.time(test_single_cluster <- run_nlm_single_cluster(my_big_list,
                                                          parameter_guess))  # 38.4 elapsed

In addition to my laptop (elapsed times in comments above), I ran the code on a server with 30 cores. There my elapsed times were 107 for apply_my_objfn and 74 for run_nlm_single_cluster. I'm surprised that the times were longer than on my puny little laptop, but it makes sense that the single cluster parallel optimization beats the regular non-parallel version when you have more cores.


Another edit, for completeness (see comments under Erwin's answer): here is a non-parallel solution using analytical gradients. Surprisingly, it is slower than with numerical gradients.

## Add gradients
my_objfn_value_and_gradient <- function(list_element, parameter) {
    return(c(sum((list_element - parameter) ^ 2), -2*sum(list_element - parameter)))
}

apply_my_objfn_with_gradient <- function(parameter, very_big_list) {
    ## Returns objfn value with gradient attribute, see ?nlm
    objfn_values_and_grads <- lapply(very_big_list, my_objfn_value_and_gradient, parameter=parameter)
    objfn_value_and_grad <- Reduce("+", objfn_values_and_grads)
    stopifnot(length(objfn_value_and_grad) == 2)  # First is objfn value, second is gradient
    objfn_value <- objfn_value_and_grad[1]
    attr(objfn_value, "gradient") <- objfn_value_and_grad[2]
    return(objfn_value)
}

system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 37.4 elapsed
system.time(test_regular_grad <- nlm(apply_my_objfn_with_gradient, parameter_guess,
                                     very_big_list=my_big_list, print.level=0,
                                     check.analyticals=FALSE))  # 45.0 elapsed

I'd be curious to know what's going on here. That said, my question is still How can I speed up this sort of optimization problem using parallelization?

Adrian
  • 3,138
  • 2
  • 28
  • 39
  • I think what I'm looking for is related to the comments in https://stackoverflow.com/questions/6689937/r-problem-with-foreach-dopar-inside-function-called-by-optim -- particularly wanting to " load chunks of data individually in each node" – Adrian Oct 09 '16 at 19:12
  • https://github.com/hadley/multidplyr/blob/master/vignettes/multidplyr.md looks relevant, I'll have a look at that – Adrian Oct 09 '16 at 19:12
  • The "Example: Cost of data movement" slide at http://www.labs.hpe.com/research/systems-research/R-workshop/Indrajit-talk5.pdf is what I had in mind when I wrote "don't want very_big_list to be copied more than necessary" in my example code above – Adrian Oct 09 '16 at 19:40
  • Maybe https://github.com/vertica/DistributedR is what I'm looking for – Adrian Oct 09 '16 at 19:42
  • The following answer could help: https://stackoverflow.com/questions/3757321/moving-beyond-rs-optim-function/50163077#50163077 – Nairolf May 03 '18 at 20:36

1 Answers1

2

Looks to me there is too much overhead in the parallel function evaluation to make it worthwhile. Consider creating and stopping the cluster once instead of in each evaluation. Also I believe you don't provide gradients so the solver will likely do finite differences, which can lead to a large number of function evaluation calls. You may want to consider providing gradients.

Erwin Kalvelagen
  • 15,677
  • 2
  • 14
  • 39
  • Thank you (+1). I added an edit where I create and stop the cluster once per nlm call as you suggested -- it does speed things up, but the non-parallel version is still faster. I agree with you that providing gradients would probably help, but I'm interested in the performance of the parallelized version relative to the non-parallel version. Are there any other obvious ways to improve the parallelized version? Even with `apply_my_objfn_in_parallel_single_cluster` I think the data in `very_big_list` is copied more than necessary. Is that correct? – Adrian Oct 09 '16 at 12:25
  • 1
    I believe that `parLapply` shuffles a lot of data around, and if you just make one cheap pass over the data in the parallel step the total overhead may be prohibitive. One way to overcome this is to use some shared memory construct, like in package `bigmemory`. – Erwin Kalvelagen Oct 09 '16 at 19:54
  • the fourth slide of http://www.labs.hpe.com/research/systems-research/R-workshop/Indrajit-talk5.pdf looks relevant. Is there a way to (1) partition `my_big_list` into N pieces (where N = number of cores on my machine), (2) send a piece to each worker (once, and then the workers have a permanent reference to their piece of data), and (3) have the workers compute the objective function value on their piece of the list each time `nlm` considers a new parameter value? – Adrian Oct 13 '16 at 09:51