Problem
I am looking for a fast (ideally constant-time) way to take a large slice a long raw vector in R. For example:
obj <- raw(2^32)
obj[seq_len(2^31 - 1)]
Even with ALTREP, base R takes too long.
system.time(obj[seq_len(2^31 - 1)])
#> user system elapsed
#> 19.470 38.853 148.288
Why?
Because I am trying to speed up storr
in order speed up drake
. I want storr
to save long raw vectors more quickly. writeBin()
is super fast, but it still cannot handle vectors more than 2^31 - 1 bytes long. So I want to save the data in manageable chunks as described here. This almost works, but creating the chunks is too slow, and it duplicates too much data in memory.
Ideas
Let's create a function
slice_raw <- function(obj, from, to) {
# ???
}
which is essentially equivalent to
obj[seq(from, to, by = 1L)]
and which is O(1) in both time and memory. In theory, all we should need to do is
- Pass
obj
to a C function. - Create a new pointer to the first byte of
obj
. - Increment the new pointer to the start of the slice.
- Create a
RAWSXP
at the new pointer with the appropriate length (less than 2^31 bytes). - Return the
RAWSXP
.
I have a background in C, but I struggle to take full control of R's internals. I would like to access the C pointers inside SEXP
s so I can do basic pointer arithmetic and create R vectors of known lengths from undecorated C pointers. The resources I found on R's C internals do not seem to explain how to wrap or unwrap pointers. Do we need Rcpp
for this?
The following rough sketch gets at what I am trying to do.
library(inline)
sig <- c(
x = "raw", # Long raw vector with more than 2^31 - 1 bytes.
start = "integer", # Should probably be R_xlen_t.
bytes = "integer" # <= 2^31 - 1. Ideally coercible to R_xlen_t.
)
body <- "
Rbyte* result; // Just a reference. Want to avoid copying data.
result = RAW(x) + start; // Trying to do ordinary pointer arithmetic.
return asRaw(result); // Want to return a raw vector of length `bytes`.
"
slice_raw <- cfunction(sig = sig, body = body)
EDIT: some more potential workarounds
Thanks to Dirk for spurring my thinking on this one. For small enough data, we can use fst
to save a single-column data frame, where the column is the raw vector we actually care about. This use of fst
is faster than writeBin()
library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#> user system elapsed
#> 0.362 0.019 0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#> user system elapsed
#> 0.314 1.340 1.689
Created on 2019-06-16 by the reprex package (v0.3.0)
Unfortunately, it is difficult to create data frames with 2^31 or more rows. One hack is to convert the raw vector into a matrix first, and we avoid the usual integer overflow because (2^31 - 1)^2 bytes is several exabytes.
library(fst)
x <- raw(2^32)
m <- matrix(x, nrow = 2^16, ncol = 2^16)
system.time(write_fst(as.data.frame(m), tempfile()))
#> user system elapsed
#> 8.776 1.459 9.519
Created on 2019-06-16 by the reprex package (v0.3.0)
We still leave saveRDS()
in the dust, but we no longer beat writeBin()
. The conversion from a data frame to a matrix is slow, and I am not sure it would scale well.
library(fst)
x <- raw(2^30)
m <- matrix(x, nrow = 2^15, ncol = 2^15)
system.time(write_fst(as.data.frame(m), tempfile()))
#> user system elapsed
#> 1.998 0.408 2.409
system.time(writeBin(as.raw(m), tempfile()))
#> user system elapsed
#> 0.329 0.839 1.397
Created on 2019-06-16 by the reprex package (v0.3.0)
If, like Dirk suggested, we can use an R_xlen_t
to index the rows of a data frame, we might be able to avoid converting anything.