1

I'm interested in compressing many versions of a similar file. The files are PDFs with (often minor) differences between them.

My question is: Is the zip or gzip algorithm able to use the similarity between these files to improve compression? Or does it handle each file individually?

I've looked at http://www.infinitepartitions.com/art001.html from How does the GZip algorithm work?, which goes over the algorithms themselves, but doesn't answer whether implementation handles all files individually or not.

Follow-up questions: If not, are there file compression algorithms that would be able to leverage the similarity between files to aid in compression?

  • 2
    A concept that is relevant here is [solid compression](https://en.wikipedia.org/wiki/Solid_compression). .zip files does not employ this as far as I know, so for .zip files, each file is compressed separately and the compression encoder does not have knowledge of or from any of the other files. To answer your last example, the 7-Zip .7z compression format *does* allow solid compression. – Lasse V. Karlsen Jan 25 '22 at 21:18

1 Answers1

2

zip no. Files are compressed independently from each other. gzip by itself will only compress one file. What you want is tar and gzip, where tar will put the files adjacent to each other (with intervening headers), and gzip will then compress the whole thing a one stream, producing a .tar.gz file.

That will be able to take advantage of similarities that are ~32K distant from each other. If your files are much larger than 32K, then you should try xz instead of gzip, producing a .tar.xz file.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158