0

This question is related to this one, where I was asking how to replicate a user-defined function. Now I would like to parallelize the operations in order to save time. What I have preliminarly done is:

  1. I have defined a custom function my.fun(), which returns output, a matrix with 1000 rows and 20 columns.

  2. I replicate say 5 times output, and store the results in a single matrix called final through: final <- do.call(rbind, replicate(5, my.fun(), simplify=FALSE)). Hence, in this example final is a 5000-rows matrix.

What I would like to do now is to parallelize the 5 (or even more..) output replications before binding the results in the final matrix.

How would you do that? What I have (wrongly) done so far is:

    library(snowfall)

    sfInit(parallel = TRUE, cpus = 4, type = "SOCK")

    # previously defined objects manipulated within my.fun
    sfExport(...)

    my.fun = function() {
       ...
       return(output)
    }

    final <- do.call(rbind, sfSapply(1:5, fun=my.fun(), simplify=FALSE))

    sfStop()

but it returns:

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'fun' of mode 'function' was not found

Any help would be greatly appreciated! Please, consider that I do not necessairly want to use -snowfall-: the final goal is to parallelize the computation of final in an efficient way (in reality I have to make a lot of replications..).

Community
  • 1
  • 1
Stefano Lombardi
  • 1,581
  • 2
  • 22
  • 48

2 Answers2

3

sfSapply expects fun to be a function, but you hand over the result of one call to my.fun. That is, you want to hand over my.fun, not my.fun ().

cbeleites unhappy with SX
  • 13,717
  • 5
  • 45
  • 57
1

I don't have any experience with parallel computing in R. I had to add a dummy argument to the function my.func, otherwise sfSapply complains with this error

 first error: unused argument(s) (X[[1]])

So I add x as argument

  my.fun <- function(x) matrix(1:4, 2,2)

Now I tried to benchmark the parallel and the sapply solution

  sfInit(parallel = TRUE, cpus = 4)
  library(rbenchmark)
  benchmark(
  pp = sfSapply(1:20000, fun=my.fun, simplify=FALSE),
  nopp = sapply(1:20000, FUN=my.fun, simplify=FALSE))

The parallel solution is slower than the classic one!! I am really confusing. maybe others more experienced with R paraelle computing can give us a logic explanation..

 test replications elapsed relative user.self sys.self user.child sys.child
2 nopp          100   15.22    1.000     13.90     0.02         NA        NA
1   pp          100   27.28    1.792     11.95     2.04         NA        NA
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Parallelization always incurs some overhead. Only if the typical job executed in parallel takes a significant amount of time, say at least a few seconds, does parallel processing really provide an advantage. If the typical job takes milliseconds, then constantly launching jobs will incur so much overhead that the total processing time increases. Just add a few seconds of sleep to myfun to see the difference. – Paul Hiemstra Jan 28 '13 at 19:05
  • @agstudy the same error for me, I had to add the x argument too, thanks! – Stefano Lombardi Jan 28 '13 at 19:14