parallel computing with foreach Rcpp and RNG much slower than expected

Question

I have an Rcpp-function that outputs a large matrix which I want to save as an R object. My idea was to speed things up using my Rcpp function in parallel with the package foreach.

For the same matrix size, using foreach takes about more than five times as long on my Windows machine as just running the function without foreach (exclduing the set up of the workers). I am aware of the issues surrounding the execution of very small tasks in parallel (e.g. Why is the parallel package slower than just using apply?). Also I am willing to leave aside theoretical problems with running Random Number Generators in parallel as the results might no longer be truly random.

As my subtasks should be large enough, apparently the Rcpp-function I wrote does not work well in parallel, but I do not know why. Is using an RNG in the Rcpp-function simply a task that cannot be parallelized? Apart from that: is there an optimal i and with that an optimal ncol (here n_bootstrap)of my submatrices in foreach? Any help is much aprreciated. Also please feel free to comment on the code in general if you like.

Clarification: I compiled a package and use mypackage::funC within foreach

Here is an example code in R:

y <- funC(n_bootstrap = 250, n_obs_censusdata = 300000,
          locationeffects = as.numeric(1:200), 
          residuals = as.numeric(1:20000),
          X = matrix(as.numeric(1:3000000), ncol = 10), 
          beta_sample = matrix(as.numeric(1:2500), ncol = 250))

in parallel:

no_cores <- parallel::detectCores() - 2
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)

y <- foreach(i=1:5, .combine = "cbind") %dopar% {

  funC(n_bootstrap = 50,
       n_obs_censusdata = 300000, locationeffects = as.numeric(1:200), 
       residuals = as.numeric(1:20000), 
       X = matrix(as.numeric(1:3000000), ncol = 10), 
       beta_sample = matrix(as.numeric(1:2500), ncol = 250))
                       }
parallel::stopCluster(cl)

added: with bigstatsr

y <- bigstatsr::FBM(nrow = 300000, ncol = 250, type = "double")
bigstatsr::big_apply(y, a.FUN = function(y, ind, fun) {
          y[, ind] <- fun(n_bootstrap = length(ind),
                                    n_obs_censusdata = 300000,
                                    locationeffects = as.numeric(1:200),
                                    residuals = as.numeric(1:20000),
                                    X = matrix(as.numeric(1:3000000), ncol = 10), 
                                    beta_sample =  matrix(as.numeric(1:2500), ncol = 250))
          NULL
        }, a.combine = 'c', ncores = bigstatsr::nb_cores(), fun = funC)+

here is the Rcpp-code:

// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4; indent-tabs-mode: nil; -*-

#include <RcppEigen.h>
#include <random>

using namespace Rcpp;
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
SEXP funC(const int n_bootstrap,
          const int n_obs_censusdata, 
          const Eigen::Map<Eigen::VectorXd> locationeffects, 
          const Eigen::Map<Eigen::VectorXd> residuals,
          const Eigen::Map<Eigen::MatrixXd> X, 
          const Eigen::Map<Eigen::MatrixXd> beta_sample)
{

  // --------- create random sample of locations and of residuals --------- //

    // initialise random seeds 
  std::random_device rd; // used to obtain a seed for the number engine
  std::mt19937 gen(rd()); // Mersenne Twister engine 

  // initialize distributions for randam locations and residuals
  const int upperlocation = locationeffects.size();
  const int upperresiduals = residuals.size();

  std::uniform_int_distribution<> distrloc(1, upperlocation);
  std::uniform_int_distribution<> distrres(1, upperresiduals);

  // initialize and fill matrix for randam locations and residuals 
  Eigen::MatrixXd LocationEffectResiduals(n_obs_censusdata, n_bootstrap);

  for (int i=0; i<n_obs_censusdata; ++i)
    for (int j=0; j<n_bootstrap; j++)
      LocationEffectResiduals(i,j) = locationeffects[distrloc(gen)-1] + residuals[distrres(gen)-1]; // subtract 1 because in C++ indices start with 0

  // ----- create Xbeta ------- //
    Eigen::MatrixXd Xbeta = X * beta_sample;

  // ----- combine results ------- //
    Eigen::MatrixXd returnmatrix = Xbeta + LocationEffectResiduals;

  return Rcpp::wrap(returnmatrix);
}

Not directly related: with mingw `std::random_device` is actually deterministic, giving you five instances that do the same thing, c.f. https://sourceforge.net/p/mingw-w64/bugs/338/ — Ralf Stubner, Aug 21 '18 at 14:39
ah that is very good point though, thank you. Is there a way I can change the function within C++ to get around that? Or would I have to resort to other R packages like doRNG? — Nikos Bosse, Aug 21 '18 at 14:49
and would that as well be true for R functions like sample()? Obviously I could achieve the same result using R functions instead of C++, but at least sequentially I found this to be much fastern. Maybe the switch back might be worth it if I can execute the tasks in parallel — Nikos Bosse, Aug 21 '18 at 15:06
How do you execute this in parallel? Your code looks like you are using `sourceCpp()`, but that is not compatible with parallel usage, since `funC` will not be available on the workers. What is `no_cores` on your machine? `doRNG` should work when you switch to using R's RNG. Otherwise use an RNG designed for parallel work (e.g. https://cran.r-project.org/package=sitmo or https://cran.r-project.org/package=dqrng) with a combination of user supplied seed and worker number. — Ralf Stubner, Aug 21 '18 at 15:46
I compile a package und use it with mypackage::funC. parallel::detectCores() yields 8, so no_cores = 6. ok. So I could for example create an int seed input in the C++ function and then make sure that every instance of funC that is called gets a different input for seed, right? — Nikos Bosse, Aug 21 '18 at 16:07
Ok. With a package it works, but is of course more difficult for us to test. Anyway, my guess is that the overhead of combining the resulting matrices is to costly. For creating a large random matrix you are probably better of using "shared memory parallelism". See https://cran.r-project.org/web/packages/dqrng/vignettes/parallel.html for starting points. — Ralf Stubner, Aug 21 '18 at 17:26
I also used the same function in combination with bigstatsr::FBM and bigstatsr::big_apply. In this approach I never combined the matrices, but merely changed subsets of the FBM directly. Performance wise the result was no different. I will add the code later to the original post. — Nikos Bosse, Aug 21 '18 at 18:05

Ralf Stubner · Accepted Answer · 2018-08-22T21:19:44.850

Here you want to create one large matrix. Distributing this to several processes is possible in principle, but bears the cost of combining the results in the end. I suggest using "shared memory parallelism" here. I am using the OpenMP code from here as starting point for a parallel version of your algorithm:

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::plugins(cpp11)]]
#include <random>

// [[Rcpp::export]]
Eigen::MatrixXd funD(const int n_bootstrap,
                     const int n_obs_censusdata, 
                     const Eigen::Map<Eigen::VectorXd> locationeffects, 
                     const Eigen::Map<Eigen::VectorXd> residuals,
                     const Eigen::Map<Eigen::MatrixXd> X, 
                     const Eigen::Map<Eigen::MatrixXd> beta_sample,
                     int ncores) {

  // --------- create random sample of locations and of residuals --------- //

  // initialise random seeds 
  std::random_device rd; // used to obtain a seed for the number engine
  dqrng::xoshiro256plus gen(rd());

  // initialize distributions for randam locations and residuals
  const int upperlocation = locationeffects.size();
  const int upperresiduals = residuals.size();

   // subtract 1 because in C++ indices start with 0
  std::uniform_int_distribution<> distrloc(0, upperlocation - 1);
  std::uniform_int_distribution<> distrres(0, upperresiduals - 1);

  // initialize and fill matrix for randam locations and residuals 
  Eigen::MatrixXd LocationEffectResiduals(n_obs_censusdata, n_bootstrap);

  #pragma omp parallel num_threads(ncores)
  {
    dqrng::xoshiro256plus lgen(gen);      // make thread local copy of rng 
    lgen.jump(omp_get_thread_num() + 1);  // advance rng by 1 ... ncores jumps 

    #pragma omp for
    for (int i=0; i<n_obs_censusdata; ++i)
      for (int j=0; j<n_bootstrap; j++)
        LocationEffectResiduals(i,j) = locationeffects[distrloc(lgen)] + residuals[distrres(lgen)];
  }  

  // ----- create Xbeta ------- //
  Eigen::MatrixXd Xbeta = X * beta_sample;

  // ----- combine results ------- //
  Eigen::MatrixXd returnmatrix = Xbeta + LocationEffectResiduals;

  return returnmatrix;
}

On my dual-core Linux system my funD with ncores = 1 is slightly faster than your funC, probably because the used RNG is faster. With ncores = 2 it gains another 30-40%. Not bad given that not all code is executed in parallel. I don't know how good OpenMP performance is on Windows these days. It might make sense to use RcppParallel instead. But that requires more changes to your code.

The abovve code is meant to be sourced with Rcpp::sourceCpp(). When you put this into a package, you should use

CXX_STD = CXX11
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)

in Makevars(.win). Note that according to WRE, this might not worl as expected if a different compiler is used for C++11 than for C++98. IIRC Solaris is the only platform where this is the case in the default configuration. So for an internal package you should be fine.

You are marvelous, thank you. I will try to implement this later today. In this example I don't have to worry about thread safety, as I am only ever writing to one element at a time, right? Also is it correct that I don't have to do any setting up of workers in R outside Rcpp and just call X<- funD? — Nikos Bosse, Aug 22 '18 at 05:18
Also just out of curiosity, had you also run funC and encountered the same problem? — Nikos Bosse, Aug 22 '18 at 05:20
@NikosBosse Thread safety is an issue. You must use a thread safe data type (Eigen is fine AFAIK, the Rcpp types are not), and you must not write to the same element of the data structure for different indices in the loop. Yes, no need to setup a cluster or use foreach. No, I have not tried funC in parallel due to the need to setup a package. — Ralf Stubner, Aug 22 '18 at 07:15
OK, perfect. Then if I leave the code like that it should be fine. And I suppose using bigstatsr::FBM and Xptr and the bigstatsr matrix Accessor should also be fine. Would there be a better way for me to post code examples next time? Thank you again very much for your help!! — Nikos Bosse, Aug 22 '18 at 09:59
@NikosBosse Tell us how it worked out in the end. Concerning future questions: It makes sense to note if code has to be part of a package. This makes it more difficult to make a [mcve], though. One approach would be a repository on GitHub or similar, since that can be installed easily. — Ralf Stubner, Aug 22 '18 at 14:52
hmm unfortunately I get the the error message "ignoring #pragma omp parallel [-Wunknown-pragmas]". The compiler is c:/Rtools/mingw_64/bin/g++ -std=gnu++11 -I. Do you have an idea about that? I left the makevars.win to what was produced my the RcppEigen skeleton: ## With Rcpp 0.11.0 and later, we no longer need to set PKG_LIBS as there is ## no user-facing library. The include path to headers is already set by R. #PKG_LIBS = ## With R 3.1.0 or later, you can uncomment the following line to tell R to ## enable compilation with C++11 (or even C++14) where available #CXX_STD = CXX11 — Nikos Bosse, Aug 22 '18 at 16:15
@NikosBosse My code is meant to be used with `sourceCpp()`. In a package, the `Rcpp::dependes` and `Rcpp::plugins` attributes have no effect. See [WRE](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support) for the full documentation. In short: Add `SHLIB_OPENMP_CXXFLAGS` to `PKG_CXXFLAGS` and `PKG_LIBS`. — Ralf Stubner, Aug 22 '18 at 19:37
@NikosBosse In addition you should uncomment the line `#CXX_STD = CXX11` to make it clear that you are using C++11. — Ralf Stubner, Aug 22 '18 at 20:03
Unfortunately I have troubles grasping the ideas behind makefiles and also find the introductions hard to understand. I am therefore not even sure what qualifies as a comment, what is a line break and how the different elements of the makefile are separated. Apparently I cannot just delete the above and write CXX_STD = CXX11 SHLIB_OPENMP_CXXFLAGS so I am unsure where to add SHLIB_OPENMP_CXXFLAGS. In the WRE manual it also said this would not necessarily work for C++11, is that an issue? — Nikos Bosse, Aug 22 '18 at 20:22
I can't thank you enough, that did the trick and also taught me something about makefiles. On my laptop I had a 30-40 % speed increase as well. Thank you very much! — Nikos Bosse, Aug 23 '18 at 11:06

parallel computing with foreach Rcpp and RNG much slower than expected

1 Answers1