2

Problem

I am looking for a fast (ideally constant-time) way to take a large slice a long raw vector in R. For example:

obj <- raw(2^32)
obj[seq_len(2^31 - 1)]

Even with ALTREP, base R takes too long.

system.time(obj[seq_len(2^31 - 1)])
#>    user  system elapsed 
#>  19.470  38.853 148.288 

Why?

Because I am trying to speed up storr in order speed up drake. I want storr to save long raw vectors more quickly. writeBin() is super fast, but it still cannot handle vectors more than 2^31 - 1 bytes long. So I want to save the data in manageable chunks as described here. This almost works, but creating the chunks is too slow, and it duplicates too much data in memory.

Ideas

Let's create a function

slice_raw <- function(obj, from, to) {
  # ???
}

which is essentially equivalent to

obj[seq(from, to, by = 1L)]

and which is O(1) in both time and memory. In theory, all we should need to do is

  1. Pass obj to a C function.
  2. Create a new pointer to the first byte of obj.
  3. Increment the new pointer to the start of the slice.
  4. Create a RAWSXP at the new pointer with the appropriate length (less than 2^31 bytes).
  5. Return the RAWSXP.

I have a background in C, but I struggle to take full control of R's internals. I would like to access the C pointers inside SEXPs so I can do basic pointer arithmetic and create R vectors of known lengths from undecorated C pointers. The resources I found on R's C internals do not seem to explain how to wrap or unwrap pointers. Do we need Rcpp for this?

The following rough sketch gets at what I am trying to do.

library(inline)
sig <- c(
  x = "raw",         # Long raw vector with more than 2^31 - 1 bytes.
  start = "integer", # Should probably be R_xlen_t.
  bytes = "integer"  # <= 2^31 - 1. Ideally coercible to R_xlen_t.
)
body <- "
Rbyte* result;           // Just a reference. Want to avoid copying data.
result = RAW(x) + start; // Trying to do ordinary pointer arithmetic.
return asRaw(result);    // Want to return a raw vector of length `bytes`.
"
slice_raw <- cfunction(sig = sig, body = body)

EDIT: some more potential workarounds

Thanks to Dirk for spurring my thinking on this one. For small enough data, we can use fst to save a single-column data frame, where the column is the raw vector we actually care about. This use of fst is faster than writeBin()

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

Created on 2019-06-16 by the reprex package (v0.3.0)

Unfortunately, it is difficult to create data frames with 2^31 or more rows. One hack is to convert the raw vector into a matrix first, and we avoid the usual integer overflow because (2^31 - 1)^2 bytes is several exabytes.

library(fst)
x <- raw(2^32)
m <- matrix(x, nrow = 2^16, ncol = 2^16)
system.time(write_fst(as.data.frame(m), tempfile()))
#>    user  system elapsed 
#>   8.776   1.459   9.519

Created on 2019-06-16 by the reprex package (v0.3.0)

We still leave saveRDS() in the dust, but we no longer beat writeBin(). The conversion from a data frame to a matrix is slow, and I am not sure it would scale well.

library(fst)
x <- raw(2^30)
m <- matrix(x, nrow = 2^15, ncol = 2^15)
system.time(write_fst(as.data.frame(m), tempfile()))
#>    user  system elapsed 
#>   1.998   0.408   2.409
system.time(writeBin(as.raw(m), tempfile()))
#>    user  system elapsed 
#>   0.329   0.839   1.397

Created on 2019-06-16 by the reprex package (v0.3.0)

If, like Dirk suggested, we can use an R_xlen_t to index the rows of a data frame, we might be able to avoid converting anything.

landau
  • 5,636
  • 1
  • 22
  • 50
  • You never "need" Rcpp because it is "merely" a layer on top of the _same_ C API provided by R which any C or C++ routine would face. But Rcpp may make a few things easier. In particular, we have a number of `Rcpp::XPtr` examples. As for your issue, why not just use `mmap` to read segments of a memory-mapped file? No size constraints there. – Dirk Eddelbuettel Jun 16 '19 at 00:13
  • Easier, absolutely. I just noticed that `Rcpp::RawVector`s understand ordinary pointer arithmetic, which is already half the solution. For that route, all that remains is a way to control the termination of each slice so `writeBin()` does not try to write the entire tail of the original data. (For raw vectors, we cannot use the `size` argument to tell `writeBin()` to stop early). Would `Rcpp::XPtr`s help? I do not yet see the connection to smart pointers. I would rather not free (or allocate) any dynamic memory here, automatically or otherwise. – landau Jun 16 '19 at 12:17
  • Also, I had a look at the `mmap` R package. `mmap()` would be perfect if the data were already in storage. However, for `storr` and `drake`, I am looking for a way to store data that initially only exists in memory. In `mmap`, this happens through `as.mmap()`, which in turn calls `writeBin()` with no chunking. So it suffers from the same issues I am facing here. – landau Jun 16 '19 at 12:21
  • 1
    Have you looked at [fst](https://www.fstpackage.org/) which already does multithreading and optimal chunking? It is AFAIK the fastest way to/from disk. – Dirk Eddelbuettel Jun 16 '19 at 12:55
  • Yes, and I do have plans for it: https://github.com/richfitz/storr/issues/103. But because `storr` needs to work on arbitrary data structures, I still think there is value in raising the efficiency to the level of `writeBin()` for non-`data.frame`s. – landau Jun 16 '19 at 13:04
  • Wait... the solution was right in front of my nose: use `fst` to store a single-column data frame, where the column is the raw vector. Thanks for the nudge, Dirk! – landau Jun 16 '19 at 17:33
  • Much better. I was thinking for a way to tell you to not look into optimising `writeBin` as Mark had done just that with `fst`. But I ddn't quite connect the dots either.... – Dirk Eddelbuettel Jun 16 '19 at 17:35
  • Actually, I think I may have spoken too soon. `data.frame(x = raw(2^32))` fails due to integer overflow. There still may be a workaround, but I will need to do some more digging. The encouraging part is that `fst` is indeed faster than `writeBin()` for small enough data for both approaches to handle. – landau Jun 16 '19 at 17:40
  • 1
    Maybe do what R does and index with a double -- difference betweeb `R_len_t` and `R_xlen_t`. – Dirk Eddelbuettel Jun 16 '19 at 17:42
  • I like that idea. I am having trouble with `df <- list(raw_vector); class(df) <- "data.frame"` because then `read_fst()` segfaults, presumably because the dimensions are missing. Is there a way to assign an `R_xlen_t` number of rows post hoc? `dim<-` seems to reject doubles, and I am having trouble finding an equivalent function in `Rcpp`. – landau Jun 16 '19 at 19:20

2 Answers2

1

Although, currently, data.frame's with long vector columns are not supported very well, you can still use fst to serialize long raw vectors:

# method for writing a raw vector to disk
write_raw <- function(x, path, compress = 50) {

  # create a list and add required attributes
  y <- list(X = x)
  attributes(y) <- c(attributes(y), class = "data.frame")

  # serialize and compress to disk
  fst::write_fst(y, path, compress)
}

# create raw vector of length >2^31
x <- rep(as.raw(0:255), 2^23 + 10)

# write raw vector
write_raw(x, "raw_vector.fst", 100)

With this scheme there is no need to split the vector in multiple parts (which, as you already indicate, will slow down serialization significantly). The raw vector can be re-read without any copying or slicing:

# method for reading a raw vector from disk
read_raw <- function(path) {

  # read from disk
  z <- fst::read_fst(path)

  z$X
}

z <- read_raw("raw_vector.fst")

fst::hash_fst(x) == fst::hash_fst(z)
#> [1] TRUE TRUE

(note that at the moment you need the fst development version for reading with long vector support)

In your setup, you will always be serializing the complete raw vector to disk as a whole (just like saveRDS(). Because you do not need random access to the stored vector, the meta-data stored in the fst file is a bit of an overkill. You might also test a setup where you compress the raw vector using compress_fst() and then store the result using saveRDS(raw_vec, compress = FALSE).

The advantage of such a setup would be that the compressor can use bigger chunks for compression, increasing the compression ratio (effect can be significant). Using larger chunks can also speed up compression.

On the other hand, the disadvantage is that you are not compressing during the write to disk as with write_fst(), so that effect might slow down your serialization. And you don't have random access anymore, but you don't really need that anyway.

If you implement a two-step process (first compressing the data and then serializing it), you will be able to allow for different compressors if the user would opt for that (for example slower compressors with very high compression ratio for slow disks).

MarkKlik
  • 291
  • 2
  • 5
1

Had the same challenge. Here is small Rcpp function to accomplish the task

Rcpp::RawVector raw_slice(
  const Rcpp::RawVector &x, 
  const R_xlen_t offset, 
  const R_xlen_t size) {

  Rcpp::RawVector result = Rcpp::no_init(size);
  memcpy ( &result[0], &x[offset - 1], size );
  return result;
}
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38