Parallelization of Rcpp without inline/ creating a local package

Question

I am creating a package that I hope to eventually put onto CRAN. I have coded much of the package in C++ with the help of Rcpp and now would like to enable parallelization of this C++ code. I am using the foreach package, however, I am open to switch to snow or a different library if this would work better.

I started by trying to parallelize a simple function:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;


// [[Rcpp::export]]
arma::vec rNorm_c(int length) {
  return arma::vec(length, arma::fill::randn);
}


/*** R

n_workers <- parallel::detectCores(logical = F)
cl <- parallel::makeCluster(n_workers)
doParallel::registerDoParallel(cl)

n <- 10
library(foreach)

foreach(j = rep(n, n), 
        .noexport = c("rNorm_c"), 
        packages = "Rcpp") %dopar% {rNorm_c(j)}
*/

I added the .noexport because without, I get the error Error in { : task 1 failed - "NULL value passed as symbol address". This led me to this SO post which suggested doing this.

However, I now receive the error Error in { : task 1 failed - "could not find function "rNorm_c"", presumably because I have not followed the top answers instructions to load the function separately at each node. I am unsure of how to do this.

This SO post demonstrates how to do this by writing the C++ code inline, however, since the C++ code for my package is multiple functions, this is likely not the best solution. This SO post advises to create a local package for the workers to load and make calls to, however, since I am hoping to make this code available in a CRAN package, it does not seem as though a local package would be possible unless I wanted to attempt to publish two CRAN packages.

Any suggestions for how to approach this or references to resources for parallelization of Rcpp code would be appreciated.

EDIT:

I used the above function to create a package called rnormParallelization. In this package, I also included a couple of R functions, one of which made use of the snow package to parallelize a for loop using the rNorm_c function:

rNorm_samples_for <- function(num_samples, length){
  sample_mat <- matrix(NA, length, num_samples)
  for (j in 1:num_samples){
    sample_mat[ , j] <- rNorm_c(length)
  }
  return(sample_mat)
}

rNorm_samples_snow1 <- function(num_samples, length){
  clus <- snow::makeCluster(3)
  snow::clusterExport(clus, "rNorm_c") 
  out <- snow::parSapply(clus, rep(length, num_samples), rNorm_c)
  snow::stopCluster(clus)
  return(out)
}

Both functions work as expected:

> rNorm_samples_for(2, 3)
            [,1]       [,2]
[1,] -0.82040308 -0.3284849
[2,] -0.05169948  1.7402912
[3,]  0.32073516  0.5439799

> rNorm_samples_snow1(2, 3)
            [,1]       [,2]
[1,] -0.07483493  1.3028315
[2,]  1.28361663 -0.4360829
[3,]  1.09040771 -0.6469646

However, the parallelized version works considerably slower:

> microbenchmark::microbenchmark(
+   rnormParallelization::rNorm_samples_for(1e3, 1e4),
+   rnormParallelization::rNorm_samples_snow1(1e3, 1e4)
+ )
Unit: milliseconds
                                                   expr       min        lq
   rnormParallelization::rNorm_samples_for(1000, 10000)  217.0871  249.3977
 rnormParallelization::rNorm_samples_snow1(1000, 10000) 1242.8315 1397.7643
      mean    median        uq       max neval
  320.5456  285.9787  325.3447  802.7488   100
 1527.0406 1482.5867 1563.0916 3411.5774   100

Here is my session info:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rnormParallelization_1.0

loaded via a namespace (and not attached):
[1] microbenchmark_1.4-7 compiler_4.1.1       snow_0.4-4          
[4] parallel_4.1.1       tools_4.1.1          Rcpp_1.0.7

GitHub repo with both of these scripts

I think you may have gotten carried away over interpretation of the word "local" -- in this context it just means "any odd R package, be it from CRAN or not". The key point is that parallel workers are in different _processes_ or even _on different machines_ so they need the code. Easiest way to give it to them in a reliable and predictable manner is via a package. — Dirk Eddelbuettel, Oct 31 '21 at 19:32
Have both in the same package, and refer to it using `::rNorm_c(j)`. — F. Privé, Oct 31 '21 at 21:04
@F.Privé I am not entirely sure where you are suggesting to do this, but I believe in my updates above, I have incorporated this idea - however, `snow::clusterExport(clus, "rnormParallelization::rNorm_c")` did not work, and so I omitted the package name. Any thoughts you might have as to the cause of the decrease in speed are appreciated. — Jacob Helwig, Oct 31 '21 at 23:57
@DirkEddelbuettel Thanks, Dirk. I believe that I have attempted what you are suggesting, as can be seen in my update above, and although my code is now functioning, I observed a drastic decrease in speed compared to the non-parallelized version- any thoughts? Do I perhaps need a more complex use case to observe benefits of parallelization? — Jacob Helwig, Nov 01 '21 at 00:06
Try it with `parallel` package, which might have better data splitting/less overhead. Alternatively, explicitly split your data into 3 chunks (one per thread) and send one chunk to each thread whole. — thc, Nov 12 '21 at 21:22

Parallelization of Rcpp without inline/ creating a local package

0 Answers0