6

Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.

I couldn't find any function like that on zlib manual section "gzip File Access Functions".

At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.

André Puel
  • 8,741
  • 9
  • 52
  • 83
  • Note: I know there are similiar questions, but none of them answers if there is actually a zlib function for that. – André Puel Feb 09 '12 at 10:28

1 Answers1

21

There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.

First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)

Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)

Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.

So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.

pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Do you know if bzip2 has the same limitations? Since I am using total size to measure the progress of decompression, decompressing first is not a option. – André Puel Feb 09 '12 at 16:15
  • 2
    You can simply use consumption of the compressed data for your progress indicator, instead of the generation of uncompressed data. To first order, they are proportional so you would see the same % indication. – Mark Adler Feb 09 '12 at 16:59
  • 1
    What do you mean by "decompressing, or at least decoding"? What's the difference between "decompressing" and "decoding"? – allyourcode Aug 27 '12 at 22:53
  • 2
    You can decode the Huffman codes and count how many bytes would be generated, without actually generating them. That would be faster than complete decompression, which generates the decompressed bytes. – Mark Adler Aug 28 '12 at 01:44