0

I am running a code using the parallel package and I would like to include the c++ implementation of a part of this code using Rcpp. It seems that the question has been partially addressed:

  1. here: Using Rcpp function in parLapply on Windows
  2. and here: Using Rcpp functions inside of R's par*apply functions from the parallel package
  3. and in many other forums

Apparently, the optimal solution would be to include the c++ implementation into a package, build it, and export the function from the package.

The following are the steps that I took and I present them in a simplified version of the original code, so that anyone can reproduce it. I am working in Windows.

First, let's define the c++ implementation.

#include <Rcpp.h>
#include <numeric>
#include <math.h>
using namespace Rcpp;

NumericVector add(NumericVector x, NumericVector y)
{
  return (x+y);
}

// [[Rcpp::export]]
NumericVector f_sample_p(
    NumericVector x, 
    int nb = 5
)
{
  NumericVector x1;
  NumericVector x2;
  
  // new sample 
  x1 = sample(x, nb, true);
  x2 = sample(x, nb, true);
  
  return add(x1, x2);
}

The code contains two functions: one named add that is not exported to R and one named f_sample_p that uses add and is exported to R. This example code will generate a random sample in a fancy way given an input vector. The structure of the original code is similar in which I cannot pack everything in a single function.

As a second step, I build the package using Rcpp.package.skeleton in a user-defined path.

library(Rcpp)

cpp_src_path <- "~/R/rcpp/sample_p.cpp" # where I store the c++ implementation

dest_path_of_pkg <- "~/R/rpkg/" # where I want to build the skeleton

Rcpp.package.skeleton(name = "mypkg", # the name of the package
                      list = character(), # I suppose I have to leave like this?
                      path = dest_path_of_pkg, # here I set the path
                      force = T, 
                      code_files = character(), # because I don't have any R codes
                      cpp_files = cpp_src_path, # set the cpp source
                      example_code = F, attributes = F, module = F
                      )

# if I set attributes = FALSE then I have to compile the attributes
compileAttributes("~/R/rpkg/mypkg", TRUE)

I let the parameter list empty because if I set it to "f_sample_p" it will not work. Also I am not sure about all other parameters. Yet, it works and creates the skeleton. However, I have to manually edit the file NAMESPACE and change export("Rcpp.fake.fun") to exportPattern("^[[:alpha:]]+") to prevent the following error:

Error: package or namespace load failed for 'mypkg' in namespaceExport(ns, exports):
   undefined exports: Rcpp.fake.fun

Or maybe one can export directly export("f_sample_p"), since it is the only function. The exportPattern method is a shortcut. Is there a way to set this in the parameters of Rcpp.package.skeleton instead of manually edit the NAMESPACE? This manual edit would prevent the following error:

Error: package or namespace load failed for 'mypkg' in namespaceExport(ns, exports):
 undefined exports: Rcpp.fake.fun

As a third step, I build the package in a user-defined path.

lib_path <- "~/R/rlib/"

install.packages("~/R/rpkg/mypkg", # where is the package
                 lib=lib_path, # where I want to build it
                 repos=NULL, # NULL because I install from local files
                 type = "source") # from the skeleton and is not a zipped tarball

Now the package is ready to use.

require("mypkg", "~/R/rlib/")
f_sample_p(1:15, 8)
# [1] 16  8  5 21  6 17 21 24

I generate 8 random numbers starting from a vector from 1 to 15.

As a final step, I am ready to use this function in my R code. The following is also a simplification of my actual code. This runs many times (say 100) the function sim_function that calls the function f_sample_p and it does with lapply.

library(parallel)

sim_function <- function(n){
  # define the simulation function to be used in lapply
  z <- f_sample_p(n, 8) # samples 8 elements from the vector n
  sum(z) # return the sum
}

# prepare the input: this will loop 100 times in the simulation function
x <- runif(100, 1, 10) 

# this works well in classic lapply
result <- lapply(x, sim_function)

# but it won't with parLapply
cl <- makeCluster(2)
clusterExport(cl = cl, varlist = c(
  "f_sample_p" # the variable list
))

result <- parLapply(cl, x, sim_function) # this generates the error

This last line generates the following error:

Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: object '_mypkg_f_sample_p' not found

It seems that parLapply expects the function _mypkg_f_sample_p instead of f_sample_p. In fact, the call method for f_sample_p is the following:

f_sample_p

function (x, nb = 5L) 
{
    .Call(`_mypkg_f_sample_p`, x, nb)
}
<bytecode: 0x0000020e09d70448>
<environment: namespace:mypkg>

Actually, the exported function defined at the moment of the build has a changed name. If you look into the RcppExports.cpp (src folder) you find that it contains:

...
// f_sample_p
NumericVector f_sample_p(NumericVector x, int nb);
RcppExport SEXP _mypkg_f_sample_p(SEXP xSEXP, SEXP nbSEXP) {
...
}

Is there a way to change the export name?

If I try to include this _mypkg_f_sample_p into the clusterExport(), R would not find it in the environment:

clusterExport(cl = cl, varlist = c(
  "f_sample_p", "_mypkg_f_sample_p" # the variable list
))

Giving the error:

Error in get(name, envir = envir) : object '_mypkg_f_sample_p' not found

As I said, I have checked many other posts and docs but without success. I am out of ammo. Any idea?

Edit about a month later: Thinking about the function passed into parLapply(), I erroneously (as I will show further) imagined that I could cheat the function when I pass the list of variables to sim_function. All variables used explicitly within sim_function are listed in the varlist parameter of clusterExport.

For example, if this works:

sim_function <- function(n){
  # define the simulation function to be used in lapply
  z <- rep(n, 8) # <= I have replaced f_sample_p() with rep() 
  sum(z) # return the sum
}

The result will not be the same obviously, but technically it works: clusterExport will accept rep() - no question asked. Maybe, if I can pass a wrapper for my f_sample_p function, I could get around the variable defined within the scope of sim_function().

wrap_sample_p <- function(n1, n2){
  f_sample_p(n1, n2) # this is now defined outside the scope of sim_function
}
sim_function <- function(n){
  # define the simulation function to be used in lapply
  z <- wrap_sample_p(n, 8) # I pass the wrapper function
  sum(z) # return the sum
}

Then I call:

clusterExport(cl = cl, varlist = c(
  "wrap_sample_p" # the variable list
))
result <- parLapply(cl, x, sim_function) # this generates the following error
Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: could not find function "f_sample_p"

Yet, it seems to care not only about the scope of sim_function but also about the environment. And the following:

clusterExport(cl = cl, varlist = c(
  "f_sample_p", "wrap_sample_p" # the variable list
))
result <- parLapply(cl, x, sim_function) # this generates the following error
Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: object '_mypkg_f_sample_p' not found

Still, it is looking for _mypkg_f_sample_p.

  • You need @Dirk Eddelbuettel. – David J. Bosak Oct 28 '21 at 03:46
  • This might be simple: parallel R instances probably don't know about your R package, because they don't know the library path "~/R/rlib/". You could install your package in the default library path. Alternatively `require` the R package within the loop (or use `clusterEvalQ`). – thc Oct 29 '21 at 17:39

0 Answers0