11

It's well known that GZIP or DEFLATE (or any compression mechanism) can increase file size sometimes. Is there a maximum (either percentage or constant) that a file can be increased? What is it?

If a file is X bytes, and I'm going to gzip it, and I need to budget for file space in advance - what's the worst case scenario?

UPDATE: There are two overheads: GZIP adds a header, typically 18 bytes but essentially arbitrarily long. What about DEFLATE? That can expand content by a multiplicative factor, which I don't know. Does anyone know what it is?

Thomas Tempelmann
  • 11,045
  • 8
  • 74
  • 149
SRobertJames
  • 8,210
  • 14
  • 60
  • 107
  • 1
    I guess that would be encoding each byte as a literal. Probably 2x or so. You can prefix the compressed stream with a bool indicating whether it is actually gzipped or not. That allows you to bound the maximum space to one additional byte. – usr May 09 '14 at 18:21
  • About the 18 extra bytes: Mark explains that here (http://stackoverflow.com/a/38148423/43615). 10 bytes are the shortest gzip header (no file name) and 8 bytes are a constant trailer for a checksum and the lower 4 bytes of the original file length. – Thomas Tempelmann Jul 02 '16 at 08:24

2 Answers2

11

gzip will add a header and trailer of at least 18 bytes. The header can also contain a path name, which will add that many bytes plus a trailing zero.

The deflate implementation in gzip has the option to store 16383 bytes per block, with an overhead of five bytes. It will always choose to do so if the alternative would take more bytes. So the maximum number of compressed bytes for n input bytes is:

n+5(floor(n/16383)+1)

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Using your formula, I often have `deflate()` return with avail_out==0 on already-compressed data (e.g. video). That should not happen, right? Before the call, avail_in is 10485760, and avail_out is 10488965, i.e. 3205 more. Using zlib v1.2.5 (default on OSX 10.10.5), compression level 9, strategy 0, wbits -15. – Thomas Tempelmann Jul 02 '16 at 09:07
  • Nevermind, I've figured it out: If the output is the exact maximum as given in Mark's formula, then avail_out==0 will end up as zero, but it still means that it was successful and there's no need to loop back and feed another buffer. However, to safely tell this state, it's smart to add one more byte to the output buffer, so that deflate, when generating the max output size, will still leave one byte left in the buffer - with that, a check on avail_out==0 will never hit, and if it would, it's a clear indication something went wrong. – Thomas Tempelmann Jul 02 '16 at 09:22
0

Compressed files always have a header indicating how to decompress them.

The size of that header represents the worst case overhead when compressing a file that cannot be compressed (because there is no order/pattern to the data; it is random).

The header varies based on the specific algorithm, and may contain variable-length information as well such as a list of files in the archive.

GZip has at least 18 bytes of overhead (header + CRC-32 in the footer), and may contain optionally a list of files in the archive.

http://en.wikipedia.org/wiki/Gzip#File_format

Note that in special situations, custom compression algorithms can reduce or eliminate the header overhead. For example, I have used a custom compression dictionary known by the compressing and decompressing software to compress short texts, so that a header was not needed. That was a rather rare use case, and probably not useful in most situations (given that storage and bandwidth are relatively cheap).

Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • 1
    The blocks in GZIP can be longer as well - so, besides the +18, there's also some scale factor. Not sure what it is, though. – SRobertJames May 09 '14 at 18:58
  • That may be one of the optional headers mentioned in the Wikipedia article. Not every GZIP implementation would be obliged to include it (unless the article is wrong). – Eric J. May 09 '14 at 19:06