11

I have a zipped binary file under the Windows operating system that I am trying to read with R. So far it works using the unz() function in combination with the readBin() function.

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> readBin(bin.con,
          "double", 
          n = byte_chunk, 
          size = 8L, 
          endian = "little")
> close(bin.con)

Where zip_path is the path to the zip file, file_in_zip is the filename within the zip file that is to be read and byte_chunk the number of bytes that I want to read.

In my use case, the readBin operation is part of a loop and gradually reads the whole binary file. However, I rarely want to read everything and often I know precisely which parts I want to read. Unfortunately, readBin doesn't have a start/skip argument to skip the first n bytes. Therefore I tried to conditionally replace readBin() with seek() in order to skip the actual reading of the unwanted parts.

When I try this, I get an error:

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> seek(bin.con, where = bytes_to_skip, origin = 'current')
Error in seek.connection(bin.con, where = bytes_to_skip, origin = "current") : 
  seek not enabled for this connection
> close(bin.con)

So far, I didn't find a way to solve this error. Similar questions can be found here (unfortunately without a satisfactory answer):

Tips all over the internet suggest adding the open = 'r' argument to unz() or dropping the open argument altogether but that only works for non-binary files (since the default is 'r'). People also suggest to unzip the files first, but since the files are quite big, this is practically impossible.

Is there any work-around to seek in a binary zipped file or read with a byte offset (potentially using C++ via the Rcpp package)?

Update:

Further research seems to indicate that seek() in zip files is not an easy problem. This question suggests a c++ library that can at best use a coarse seek. This Python question indicates that an exact seek is completely impossible because of the way how zip is implemented (although it doesn't contradict the coarse seek method).

Community
  • 1
  • 1
takje
  • 2,630
  • 28
  • 47
  • in the documentation for `seek`, it says that use of seek on Windows is discouraged, so be warned. just a curious question: how is this file created? do you have control over how it is created? – chinsoon12 Feb 02 '17 at 09:13
  • Are you willing to consider other languages? This seems like a problem for languages like C/C++/Java. see this http://www.phillipciske.com/blog/index.cfm/2008/10/2/Reading-Binary-Files-in-a-Zip-File-Before-CF8 – chinsoon12 Feb 02 '17 at 09:19
  • @chinsoon12, the origin of that error is dubious as mentioned here: http://stackoverflow.com/questions/32736845/is-seek-reliable-on-modern-windows/32737017 The answer on your second question is negative. I don't create the file since it is created by a third party tool. – takje Feb 02 '17 at 10:21
  • @chinsoon12 In truth, I don't expect to find an R answer. I was hoping for a C++ answer potentially since I can add that in a package using Rcpp (but I have no previous experience in using c++). – takje Feb 02 '17 at 10:23
  • Meanwhile, I did some further research into the more general problem of random-access in zips, but it is not very reassuring. This question claims that at best you can use a coarse method to achieve random access: http://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives – takje Feb 02 '17 at 10:27
  • This Python thread also seems to suggest that it is not possible to seek in binary zip files: http://stackoverflow.com/questions/12821961/seek-a-file-within-a-zip-file-in-python-without-passing-it-to-memory – takje Feb 02 '17 at 10:28
  • You mention you do readBin in a loop that eventually reads the whole file. For a single targeted read, could you not split that into two readBins: one to "seek" by reading all bytes up to your starting point; then another to read what you're after? (Understood this isn't ideal, especially if the files are insanely large). – johnjps111 Feb 03 '17 at 04:03
  • @JohnP.Schneider That's what I'm currently doing but since I am still reading the binaries, it's still as slow as a normal read. That's why I was looking for a real seek method. – takje Feb 03 '17 at 08:25

1 Answers1

7

Here's a bit of a hack that might work for you. Here's a fake binary file:

writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
#  [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10

And here's the produced zip file:

zip("file.zip", "file.bin")
#   adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
#  [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f

This uses a temporary intermediate binary file.

system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09

This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.

This worked on win10, R-3.3.2. I'm using dd from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip and sh from RTools.

Sys.which(c("dd", "unzip", "sh"))
#                                    dd 
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe" 
#                                 unzip 
#          "c:\\Rtools\\bin\\unzip.exe" 
#                                    sh 
#             "c:\\Rtools\\bin\\sh.exe" 
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Very elegant solution. I did some tests and it seems that this solution doesn't keep the entire unzipped file in the memory. It does take some CPU time to unzip until the offset, but I guess there is really no way around that. One further improvement would be to stop the unzipping as soon as the end of the offset + count is reached. Would you have any idea how to do this? – takje Feb 06 '17 at 10:08
  • No, that's part of the problem: I think the finest resolution you have with `unzip` is "per-file". – r2evans Feb 06 '17 at 14:47
  • Are you forced to compress the doors with `zip`, or are you allowed to recompress with a different protocol/tool? – r2evans Feb 06 '17 at 14:48
  • The zip files come from an external source, so I don't have influence on how they're created. I could recompress them, but since I only need a tiny fraction of the binaries, that might be a loss of resources. – takje Feb 06 '17 at 14:53
  • (Gotta love auto-type ... "compress the doors"? Bitten once again ...) – r2evans Feb 08 '17 at 01:15