I work with a lot of files that go into the 10s of GBs. To keep stuff more manageable, I usually gzip them and uncompress the data on the fly in my programs. However, checking the content requires knowing the length of the file contents.
When files are smaller than 4GB, this is easily achieved by reading the last 4 bytes and interpreting it as an integer length, however this approach falls short when the uncompressed content is larger than 4GB (as the length no longer fits into a 32bit value).
Of course, there is always the possibility of unpacking and counting the unpacked bytes, but this is very time-consuming. Is there a faster way?
Edit 1: Further research on the file format shows that the DEFLATE block format knows uncompressed data, which is prefixed with the block length (in that case the uncompressed length, because the block isn't compressed to begin with), so at least those blocks could be skipped. However, the compressed blocks don't appear to have a length field, so that would need to be calculated from the compressed stream (in that case, the length counting algorithm would need to recreate the huffman coding and use the LZ77 distance codes to sum up the lengths.