6

Suppose we compress for example a .txt file that has 7 bytes size. After compression and convert to .zip file, the size will be 190 bytes.

Is there a way to estimate or compute the approximate size of “overhead”?

What factor affects the overhead size?

The Zlib compute the overhead: They said: “... only expansion is an overhead of five bytes per 16 KB block (about 0.03%), plus a one-time overhead of six bytes for the entire stream.”

I just put this site to tell that it's possible to estimate the "overhead" size.

Note: Overhead is some amount of extra data added into the compressed version of the data.

user3184352
  • 121
  • 7

1 Answers1

4

From the ZIP format ..

Assuming that there is only one central directory and no comments and no extra fields, the overhead should be similar to the following. (The overhead will only go up if any additional metadata is added.)

  • Per file (Local file header) - 30+len(filename)
  • Per file (Data descriptor) - 12 (to 16)
  • Per file (Central directory header) - 46+len(filename)
  • Per archive (EOCD) - 22

So the overhead, where afn is the average length of all file names, and f is the number of files:

  f * ((30 + afn) + 12 + (46 * afn)) + 22
= f * (88 + 2 * afn) + 22

This of course makes ZIP a very poor choice for very tiny bits of compressed data where a (file) structure or metadata is not required - zlib, on the other hand, is a very thin Deflate wrapper.

For small payloads, a poor Deflate implementation may also result in a significantly larger "compressed" size, such as the notorious .NET implementation ..


Examples:

  • Storing 1 file, with name "hello world note.txt" (len = 20),

    = 1 * (88 + 2 * 20) + 22 = 150 bytes overhead

  • Storing 100 files, with an average name of 14 letters,

    = 100 * (88 + 2 * 14) + 22 = 11622 bytes overhead

user2864740
  • 60,010
  • 15
  • 145
  • 220
  • Thank you so much. Could you please explain more about each bullet parts? For example I don’t know how we could obtain (EOCD) size. And how you compute ‘afn’? Thanks again. – user3184352 Mar 12 '14 at 09:39
  • `afn = (len(filename1)+len(filename2)+len(filename3)+..)/number_of_files`. Going with the other assumptions (e.g. no appending and thus no duplicate CD entries) there is only one EOCD of 22 bytes. – user2864740 Mar 12 '14 at 09:41
  • you mean EOCD always is 22 for all files? "Data descriptor" and "Local file header" are always 16 and 30 respectively? – user3184352 Mar 12 '14 at 09:51
  • Data descriptor is a fixed size, EOCD (*without* comments) is a fixed size, but the EOCD (*and* Central directory entries) can be duplicated if the ZIP file is "appended" to. The Local file header and the Central director header size both *depend* on the length of the filename (and are thus a variable size, even when *not* using extra comment/field features). – user2864740 Mar 12 '14 at 09:53
  • Thank you again. sorry if I ask many questions. what is "comment/field features". Is there a good article that explain overhead in detail. – user3184352 Mar 12 '14 at 09:58
  • @user3184352 The additional metadata can be set with the appropriate ZIP tool and can be pretty much whatever data (and whatever length up to 65k/entrry) is desired. WinZIP says [this](http://kb.winzip.com/help/winzip/HELP_COMMENT.htm) about comments: "A comment is optional text information that is embedded in a Zip file. It can be viewed, created, edited, or deleted using the Comment window." – user2864740 Mar 12 '14 at 10:00