Compressing large, near-identical files

Question

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.

I would like to compress them in an archive. My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.

Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one. Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?

If the HDF5 files contains all a dataset with the same shape and dtype (eg. nxm) you can combine them in one dataset (eg. num_of_files x n x m). Than wiith the right chunk-shape (each chunk is compressed independendly) it shouldn't be a problem to get a good result (both in terms of compression ratio and speed). A simple example if you can combine the dsets: https://stackoverflow.com/a/48997927/4045774 — max9111, Oct 25 '18 at 13:21

score 2 · Accepted Answer · answered Oct 24 '18 at 02:47

Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Compressing large, near-identical files

1 Answers1