0

I work with a lot of files that go into the 10s of GBs. To keep stuff more manageable, I usually gzip them and uncompress the data on the fly in my programs. However, checking the content requires knowing the length of the file contents.

When files are smaller than 4GB, this is easily achieved by reading the last 4 bytes and interpreting it as an integer length, however this approach falls short when the uncompressed content is larger than 4GB (as the length no longer fits into a 32bit value).

Of course, there is always the possibility of unpacking and counting the unpacked bytes, but this is very time-consuming. Is there a faster way?

Edit 1: Further research on the file format shows that the DEFLATE block format knows uncompressed data, which is prefixed with the block length (in that case the uncompressed length, because the block isn't compressed to begin with), so at least those blocks could be skipped. However, the compressed blocks don't appear to have a length field, so that would need to be calculated from the compressed stream (in that case, the length counting algorithm would need to recreate the huffman coding and use the LZ77 distance codes to sum up the lengths.

llogiq
  • 13,815
  • 8
  • 40
  • 72
  • 1
    possible duplicate of [Find the size of the file inside a GZIP file](http://stackoverflow.com/questions/9715046/find-the-size-of-the-file-inside-a-gzip-file) – Jongware Sep 14 '14 at 11:13
  • Yes and no. The question you link is limited to java. I'm still considering deleting it anyway, though the answers of the question you linked aren't what I seek. – llogiq Sep 14 '14 at 11:16
  • Mark Adler's answer contains all you need to know. But: since you compress the data yourself, you could (1) save the original size for each file in a separate document, and/or (2) calculate a typical compression rate for your data and use that to guesstimate the size. – Jongware Sep 14 '14 at 11:18

0 Answers0