I am running a code using the parallel package and I would like to include the c++ implementation of a part of this code using Rcpp. It seems that the question has been partially addressed:
- here: Using Rcpp function in parLapply on Windows
- and here: Using Rcpp functions inside of R's par*apply functions from the parallel package
- and in many other forums
Apparently, the optimal solution would be to include the c++ implementation into a package, build it, and export the function from the package.
The following are the steps that I took and I present them in a simplified version of the original code, so that anyone can reproduce it. I am working in Windows.
First, let's define the c++ implementation.
#include <Rcpp.h>
#include <numeric>
#include <math.h>
using namespace Rcpp;
NumericVector add(NumericVector x, NumericVector y)
{
return (x+y);
}
// [[Rcpp::export]]
NumericVector f_sample_p(
NumericVector x,
int nb = 5
)
{
NumericVector x1;
NumericVector x2;
// new sample
x1 = sample(x, nb, true);
x2 = sample(x, nb, true);
return add(x1, x2);
}
The code contains two functions: one named add
that is not exported to R and one named f_sample_p
that uses add
and is exported to R. This example code will generate a random sample in a fancy way given an input vector. The structure of the original code is similar in which I cannot pack everything in a single function.
As a second step, I build the package using Rcpp.package.skeleton
in a user-defined path.
library(Rcpp)
cpp_src_path <- "~/R/rcpp/sample_p.cpp" # where I store the c++ implementation
dest_path_of_pkg <- "~/R/rpkg/" # where I want to build the skeleton
Rcpp.package.skeleton(name = "mypkg", # the name of the package
list = character(), # I suppose I have to leave like this?
path = dest_path_of_pkg, # here I set the path
force = T,
code_files = character(), # because I don't have any R codes
cpp_files = cpp_src_path, # set the cpp source
example_code = F, attributes = F, module = F
)
# if I set attributes = FALSE then I have to compile the attributes
compileAttributes("~/R/rpkg/mypkg", TRUE)
I let the parameter list
empty because if I set it to "f_sample_p"
it will not work. Also I am not sure about all other parameters. Yet, it works and creates the skeleton. However, I have to manually edit the file NAMESPACE and change export("Rcpp.fake.fun")
to exportPattern("^[[:alpha:]]+")
to prevent the following error:
Error: package or namespace load failed for 'mypkg' in namespaceExport(ns, exports):
undefined exports: Rcpp.fake.fun
Or maybe one can export directly export("f_sample_p")
, since it is the only function. The exportPattern
method is a shortcut. Is there a way to set this in the parameters of Rcpp.package.skeleton
instead of manually edit the NAMESPACE? This manual edit would prevent the following error:
Error: package or namespace load failed for 'mypkg' in namespaceExport(ns, exports):
undefined exports: Rcpp.fake.fun
As a third step, I build the package in a user-defined path.
lib_path <- "~/R/rlib/"
install.packages("~/R/rpkg/mypkg", # where is the package
lib=lib_path, # where I want to build it
repos=NULL, # NULL because I install from local files
type = "source") # from the skeleton and is not a zipped tarball
Now the package is ready to use.
require("mypkg", "~/R/rlib/")
f_sample_p(1:15, 8)
# [1] 16 8 5 21 6 17 21 24
I generate 8 random numbers starting from a vector from 1 to 15.
As a final step, I am ready to use this function in my R code. The following is also a simplification of my actual code. This runs many times (say 100) the function sim_function
that calls the function f_sample_p
and it does with lapply
.
library(parallel)
sim_function <- function(n){
# define the simulation function to be used in lapply
z <- f_sample_p(n, 8) # samples 8 elements from the vector n
sum(z) # return the sum
}
# prepare the input: this will loop 100 times in the simulation function
x <- runif(100, 1, 10)
# this works well in classic lapply
result <- lapply(x, sim_function)
# but it won't with parLapply
cl <- makeCluster(2)
clusterExport(cl = cl, varlist = c(
"f_sample_p" # the variable list
))
result <- parLapply(cl, x, sim_function) # this generates the error
This last line generates the following error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: object '_mypkg_f_sample_p' not found
It seems that parLapply expects the function _mypkg_f_sample_p
instead of f_sample_p
. In fact, the call method for f_sample_p
is the following:
f_sample_p
function (x, nb = 5L)
{
.Call(`_mypkg_f_sample_p`, x, nb)
}
<bytecode: 0x0000020e09d70448>
<environment: namespace:mypkg>
Actually, the exported function defined at the moment of the build has a changed name. If you look into the RcppExports.cpp (src folder) you find that it contains:
...
// f_sample_p
NumericVector f_sample_p(NumericVector x, int nb);
RcppExport SEXP _mypkg_f_sample_p(SEXP xSEXP, SEXP nbSEXP) {
...
}
Is there a way to change the export name?
If I try to include this _mypkg_f_sample_p
into the clusterExport()
, R would not find it in the environment:
clusterExport(cl = cl, varlist = c(
"f_sample_p", "_mypkg_f_sample_p" # the variable list
))
Giving the error:
Error in get(name, envir = envir) : object '_mypkg_f_sample_p' not found
As I said, I have checked many other posts and docs but without success. I am out of ammo. Any idea?
Edit about a month later: Thinking about the function passed into parLapply()
, I erroneously (as I will show further) imagined that I could cheat the function when I pass the list of variables to sim_function
. All variables used explicitly within sim_function
are listed in the varlist
parameter of clusterExport
.
For example, if this works:
sim_function <- function(n){
# define the simulation function to be used in lapply
z <- rep(n, 8) # <= I have replaced f_sample_p() with rep()
sum(z) # return the sum
}
The result will not be the same obviously, but technically it works: clusterExport
will accept rep()
- no question asked. Maybe, if I can pass a wrapper for my f_sample_p
function, I could get around the variable defined within the scope of sim_function()
.
wrap_sample_p <- function(n1, n2){
f_sample_p(n1, n2) # this is now defined outside the scope of sim_function
}
sim_function <- function(n){
# define the simulation function to be used in lapply
z <- wrap_sample_p(n, 8) # I pass the wrapper function
sum(z) # return the sum
}
Then I call:
clusterExport(cl = cl, varlist = c(
"wrap_sample_p" # the variable list
))
result <- parLapply(cl, x, sim_function) # this generates the following error
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: could not find function "f_sample_p"
Yet, it seems to care not only about the scope of sim_function
but also about the environment. And the following:
clusterExport(cl = cl, varlist = c(
"f_sample_p", "wrap_sample_p" # the variable list
))
result <- parLapply(cl, x, sim_function) # this generates the following error
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: object '_mypkg_f_sample_p' not found
Still, it is looking for _mypkg_f_sample_p
.