Read binary files in R from a zipped file and a known starting position (byte offset)

Question

I have a zipped binary file under the Windows operating system that I am trying to read with R. So far it works using the unz() function in combination with the readBin() function.

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> readBin(bin.con,
          "double", 
          n = byte_chunk, 
          size = 8L, 
          endian = "little")
> close(bin.con)

Where zip_path is the path to the zip file, file_in_zip is the filename within the zip file that is to be read and byte_chunk the number of bytes that I want to read.

In my use case, the readBin operation is part of a loop and gradually reads the whole binary file. However, I rarely want to read everything and often I know precisely which parts I want to read. Unfortunately, readBin doesn't have a start/skip argument to skip the first n bytes. Therefore I tried to conditionally replace readBin() with seek() in order to skip the actual reading of the unwanted parts.

When I try this, I get an error:

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> seek(bin.con, where = bytes_to_skip, origin = 'current')
Error in seek.connection(bin.con, where = bytes_to_skip, origin = "current") : 
  seek not enabled for this connection
> close(bin.con)

So far, I didn't find a way to solve this error. Similar questions can be found here (unfortunately without a satisfactory answer):

https://stat.ethz.ch/pipermail/r-help/2007-December/148847.html (no answer)
http://r.789695.n4.nabble.com/reading-file-in-zip-archive-td4631853.html (no answer but reproducible example)

Tips all over the internet suggest adding the open = 'r' argument to unz() or dropping the open argument altogether but that only works for non-binary files (since the default is 'r'). People also suggest to unzip the files first, but since the files are quite big, this is practically impossible.

Is there any work-around to seek in a binary zipped file or read with a byte offset (potentially using C++ via the Rcpp package)?

Update:

Further research seems to indicate that seek() in zip files is not an easy problem. This question suggests a c++ library that can at best use a coarse seek. This Python question indicates that an exact seek is completely impossible because of the way how zip is implemented (although it doesn't contradict the coarse seek method).

in the documentation for `seek`, it says that use of seek on Windows is discouraged, so be warned. just a curious question: how is this file created? do you have control over how it is created? — chinsoon12, Feb 02 '17 at 09:13
Are you willing to consider other languages? This seems like a problem for languages like C/C++/Java. see this http://www.phillipciske.com/blog/index.cfm/2008/10/2/Reading-Binary-Files-in-a-Zip-File-Before-CF8 — chinsoon12, Feb 02 '17 at 09:19
@chinsoon12, the origin of that error is dubious as mentioned here: http://stackoverflow.com/questions/32736845/is-seek-reliable-on-modern-windows/32737017 The answer on your second question is negative. I don't create the file since it is created by a third party tool. — takje, Feb 02 '17 at 10:21
@chinsoon12 In truth, I don't expect to find an R answer. I was hoping for a C++ answer potentially since I can add that in a package using Rcpp (but I have no previous experience in using c++). — takje, Feb 02 '17 at 10:23
Meanwhile, I did some further research into the more general problem of random-access in zips, but it is not very reassuring. This question claims that at best you can use a coarse method to achieve random access: http://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives — takje, Feb 02 '17 at 10:27
This Python thread also seems to suggest that it is not possible to seek in binary zip files: http://stackoverflow.com/questions/12821961/seek-a-file-within-a-zip-file-in-python-without-passing-it-to-memory — takje, Feb 02 '17 at 10:28
You mention you do readBin in a loop that eventually reads the whole file. For a single targeted read, could you not split that into two readBins: one to "seek" by reading all bytes up to your starting point; then another to read what you're after? (Understood this isn't ideal, especially if the files are insanely large). — johnjps111, Feb 03 '17 at 04:03
@JohnP.Schneider That's what I'm currently doing but since I am still reading the binaries, it's still as slow as a normal read. That's why I was looking for a real seek method. — takje, Feb 03 '17 at 08:25

score 7 · Accepted Answer · answered Feb 06 '17 at 05:46

7

Here's a bit of a hack that might work for you. Here's a fake binary file:

writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
#  [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10

And here's the produced zip file:

zip("file.zip", "file.bin")
#   adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
#  [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f

This uses a temporary intermediate binary file.

system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09

This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.

This worked on win10, R-3.3.2. I'm using dd from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip and sh from RTools.

Sys.which(c("dd", "unzip", "sh"))
#                                    dd 
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe" 
#                                 unzip 
#          "c:\\Rtools\\bin\\unzip.exe" 
#                                    sh 
#             "c:\\Rtools\\bin\\sh.exe"

answered Feb 06 '17 at 05:46

r2evans

141,215
6
77
149

1

Very elegant solution. I did some tests and it seems that this solution doesn't keep the entire unzipped file in the memory. It does take some CPU time to unzip until the offset, but I guess there is really no way around that. One further improvement would be to stop the unzipping as soon as the end of the offset + count is reached. Would you have any idea how to do this? – takje Feb 06 '17 at 10:08
No, that's part of the problem: I think the finest resolution you have with `unzip` is "per-file". – r2evans Feb 06 '17 at 14:47
Are you forced to compress the doors with `zip`, or are you allowed to recompress with a different protocol/tool? – r2evans Feb 06 '17 at 14:48
The zip files come from an external source, so I don't have influence on how they're created. I could recompress them, but since I only need a tiny fraction of the binaries, that might be a loss of resources. – takje Feb 06 '17 at 14:53
(Gotta love auto-type ... "compress the doors"? Bitten once again ...) – r2evans Feb 08 '17 at 01:15

Read binary files in R from a zipped file and a known starting position (byte offset)

1 Answers1