I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar
with the -z
option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip
(or another tool) detect these large repeated blobs and avoid repeating them in the archive?