I have an Rcpp function that reads large BAM file (1-20GB, using htslib
) and creates several very long std::vector
s (up to 80M elements). The number of elements is not known before reading, so I cannot use Rcpp::IntegerVector
and Rcpp::CharacterVector
. As far as I understand, when I Rcpp::wrap
them for further usage, the copy is created. Is there a way to speed up the transfer of data from C++ to R in this situation? Is there a data structure that can be created within Rcpp function, be as quick to push_back
elements as std::vector
is, and passed by reference to R?
Just in case, here's how I create them currently:
std::vector<std::string> seq, xm;
std::vector<int> rname, strand, start;
And here's how I wrap and return them:
Rcpp::IntegerVector w_rname = Rcpp::wrap(rname);
w_rname.attr("class") = "factor";
w_rname.attr("levels") = chromosomes; // chromosomes contain names of the reference sequences from BAM
Rcpp::IntegerVector w_strand = Rcpp::wrap(strand);
w_strand.attr("class") = "factor";
w_strand.attr("levels") = strands; // std::vector<std::string> strands = {"+", "-"};
Rcpp::DataFrame res = Rcpp::DataFrame::create(
Rcpp::Named("rname") = w_rname,
Rcpp::Named("strand") = w_strand,
Rcpp::Named("start") = start,
Rcpp::Named("seq") = seq,
Rcpp::Named("XM") = xm
);
return(res);
Edit 1 (2021.10.19):
Thanks to everyone for comments, I need more time to check if stringfish
can be used, but I ran a slightly modified test from cpp11 package vignettes to compare it with std::vector
. Here's the code and results (showing that std::vector<int>
is still faster despite it must be Rcpp::wrap
ped upon return):
Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
std::vector<int> stdint_grow_(SEXP n_sxp) {
R_xlen_t n = REAL(n_sxp)[0];
std::vector<int> x;
R_xlen_t i = 0;
while (i < n) {
x.push_back(i++);
}
return x;
}')
library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:7), pkg = c("cpp11", "stdint"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
{
fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
bench::mark(
fun(len)
)
}
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
print(b_grow, n=Inf)
# A tibble: 12 × 6
len pkg min mem_alloc n_itr n_gc
<dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
1 100 cpp11 1.9µs 1.89KB 9999 1
2 1000 cpp11 6.1µs 16.03KB 9999 1
3 10000 cpp11 58.11µs 256.22KB 7267 12
4 100000 cpp11 488.15µs 2MB 815 11
5 1000000 cpp11 4.34ms 16MB 88 14
6 10000000 cpp11 97.39ms 256MB 4 5
7 100 stdint 1.6µs 2.93KB 10000 0
8 1000 stdint 3.36µs 6.45KB 9998 2
9 10000 stdint 19.87µs 41.6KB 9998 2
10 100000 stdint 181.88µs 393.16KB 2571 4
11 1000000 stdint 1.91ms 3.82MB 213 3
12 10000000 stdint 36.09ms 38.15MB 9 1
Edit 2:
std::vector<std::string>
is marginally slower than cpp11::writable::strings
in these test conditions, but more memory-efficient:
Rcpp::cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
std::vector<std::string> stdstr_grow_(SEXP n_sxp) {
R_xlen_t n = REAL(n_sxp)[0];
std::vector<std::string> x;
R_xlen_t i = 0;
while (i++ < n) {
std::string s (i, 33);
x.push_back(s);
}
return x;
}')
cpp11::cpp_source(code='
#include "cpp11/strings.hpp"
[[cpp11::register]] cpp11::writable::strings cpp11str_grow_(R_xlen_t n) {
cpp11::writable::strings x;
R_xlen_t i = 0;
while (i++ < n) {
std::string s (i, 33);
x.push_back(s);
}
return x;
}
')
library(cpp11test)
grid <- expand.grid(len = 10 ^ (0:5), pkg = c("cpp11str", "stdstr"), stringsAsFactors = FALSE)
b_grow <- bench::press(.grid = grid,
{
fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
bench::mark(
fun(len)
)
}
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
print(b_grow, n=Inf)
# A tibble: 12 × 6
len pkg min mem_alloc n_itr n_gc
<dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
1 1 cpp11str 1.22µs 0B 10000 0
2 10 cpp11str 3.02µs 0B 9999 1
3 100 cpp11str 22µs 1.89KB 9997 3
4 1000 cpp11str 765.28µs 541.62KB 602 2
5 10000 cpp11str 66.69ms 47.91MB 8 0
6 100000 cpp11str 6.83s 4.62GB 1 0
7 1 stdstr 1.38µs 2.49KB 10000 0
8 10 stdstr 1.86µs 2.49KB 10000 0
9 100 stdstr 16.44µs 3.32KB 10000 0
10 1000 stdstr 898.23µs 10.35KB 511 0
11 10000 stdstr 73.55ms 80.66KB 7 0
12 100000 stdstr 7.54s 783.79KB 1 0
Solution (2022.01.12):
... for those who have similar question. In this particular case I didn't need to use std::vector
data within R. So XPtr
easily solved my problem, cutting BAM loading time nearly twice. The pointer is created:
std::vector<std::string>* seq = new std::vector<std::string>;
std::vector<std::string>* xm = new std::vector<std::string>;
and then stored as a data.frame
attribute:
Rcpp::DataFrame res = Rcpp::DataFrame::create(
Rcpp::Named("rname") = w_rname,
Rcpp::Named("strand") = w_strand,
Rcpp::Named("start") = start
);
Rcpp::XPtr<std::vector<std::string>> seq_xptr(seq, true);
res.attr("seq_xptr") = seq_xptr;
Rcpp::XPtr<std::vector<std::string>> xm_xptr(xm, true);
res.attr("xm_xptr") = xm_xptr;
and reused elsewhere as following:
Rcpp::XPtr<std::vector<std::string>> seq((SEXP)df.attr("seq_xptr"));
Rcpp::XPtr<std::vector<std::string>> xm((SEXP)df.attr("xm_xptr"));