2

I have a large raw vector in R (i.e. array of binary data) that I want to write to disk, but I'm getting an error telling me the vector is too large. Here's a reproducible example and the error I get:

> writeBin(raw(1024 * 1024 * 1024 * 2), "test.bin")

Error in writeBin(raw(1024 * 1024 * 1024 * 2), "test.bin") : 
  long vectors not supported yet: connections.c:4147

I've noticed that this is linked to the 2 GB file limit. If I try to write a single byte less (1024 * 1024 * 1024 * 2 - 1), it works just fine.

I was thinking about doing some kind of workaround, where I write chunks of the large file to disk in batches, only appending the binary data to the disk, like this:

large_file = raw(1024 * 1024 * 1024 * 2) 
chunk_size = 1024*1024*512
n_chunks = ceiling(length(large_file)/chunk_size)

for (i in 1:n_chunks)
{
  start_byte = ((i - 1) * chunk_size) + 1
  end_byte = start_byte + chunk_size - 1
  if (i == n_chunks)
    end_byte = length(large_file)
  this_chunk = large_file[start_byte:end_byte]
  appendBin(this_chunk, "test.bin") # <-- non-existing magical formula!
}

But I can't find any kind of function like the "appendBin" I wrote above or any other documentation in R that tells me how to append data straight to the disk.

So my question boils down to this: does anyone know how to append raw (binary) data to a file already on disk without having to read the full file on disk to memory first?

Extra details: I'm currently using R version 3.4.2 64bit on a Windows 10 PC with 192GB of RAM. I tried on another PC (R version 3.5 64bit, Windows 8 with 8GB of RAM) and had the exact same problem.

Any kind of insight or workaround would be greatly appreciated!!!

Thank you!

user2554330
  • 37,248
  • 4
  • 43
  • 90
Felipe D.
  • 1,157
  • 9
  • 19
  • 3
    This has nothing to do with a file system 2 GB limit, it has to do with the limit of many functions within R to work on vectors of at most 2 GB of elements. (Longer vectors are called `long vectors`, but apparently `writeBin` doesn't support them.) @MichaelChirico's suggestion to use a connection opened in append mode is good, but do use mode "ab", so the data isn't interpreted as text data, with line endings messed up. – user2554330 May 11 '18 at 17:28
  • Thanks guys! Both your comments helped a lot! I'll post the final solution below. Thanks again! – Felipe D. May 11 '18 at 17:48
  • maybe able to work with a connection at mode a? http://astrostatistics.psu.edu/su07/R/html/base/html/connections.html – MichaelChirico Apr 18 '19 at 02:41

1 Answers1

3

Thanks to @MichaelChirico and @user2554330, I was able to figure out a work around. Essentially, I just need to open the file in "a+b" mode as a new connection and feed that file connection into the writeBin function.

Here's a copy of the working code.

large_file = raw(1024 * 1024 * 1024 * 3) 
chunk_size = 1024*1024*512
n_chunks = ceiling(length(large_file)/chunk_size)

if (file.exists("test.bin"))
  file.remove("test.bin")

for (i in 1:n_chunks)
{
  start_byte = ((i - 1) * chunk_size) + 1
  end_byte = start_byte + chunk_size - 1
  if (i == n_chunks)
    end_byte = length(large_file)
  this_chunk = large_file[start_byte:end_byte]
  output_file = file(description="test.bin",open="a+b")
  writeBin(this_chunk, output_file)
  close(output_file)
}

I know it's ugly that I'm opening and closing the file multiple times, but that kept the error from popping up with even bigger files.

Thanks again for the insights, guys! =)

Felipe D.
  • 1,157
  • 9
  • 19
  • just a suggestion i think it would be cleaner to use a while loop here, no need to calculate the number of chunks up front just go until end_byte > length(large_file) – MichaelChirico May 12 '18 at 01:12
  • Oh wow! Thanks! I thought that if I did that, it would force me to write to disk for every single byte, making the code run slower than if I did it in chunks. I'll test it out and see what happens. Thanks for the tip! – Felipe D. May 18 '18 at 18:42
  • I am trying to borrow this answer for github.com/richfitz/storr/issues/107 (related: github.com/richfitz/storr/issues/107 and github.com/richfitz/storr/issues/103#issuecomment-502097274). Slicing (large_file[start_byte:end_byte]) is a bottleneck. Benchmarks forthcoming. – landau Jun 15 '19 at 00:26
  • Benchmarks: https://github.com/richfitz/storr/issues/107#issuecomment-502318772. For now, I think we need to find a good chunk size, which I suspect will be smaller than 2^20. – landau Jun 15 '19 at 00:40