0

I'm trying to come up with a solution to compress few petabytes of data I have which will be stored in AWS S3. I was thinking of using gzip compression and was wondering if compression could corrupt data. I tried searching but was not able to find any specific instances where gzip compression actually corrupted the data such that it was no longer recoverable.

I'm not sure if this is the correct forum for such question, but do I need to verify if data was correctly compressed? Also, any specific examples/data points would help.

user401445
  • 1,014
  • 1
  • 16
  • 41
  • 1
    No, gzip compression does not cause corrupt data. It is [lossless compression](https://en.wikipedia.org/wiki/Lossless_compression). – Jesper Oct 26 '17 at 20:29
  • 1
    Have you checked [this](https://superuser.com/questions/1068522/how-to-verify-whether-a-compressed-gz-is-corrupted-or-not) solution? –  Oct 26 '17 at 20:30
  • I suspect any filesystem where gzip would work can handle petabytes – A.Rashad Oct 26 '17 at 20:38
  • @Mysticate I have checked this solution but it means writing data to disk and running another CPU intensive task. I would like to avoid it if possible – user401445 Oct 26 '17 at 20:39

4 Answers4

2

I would not recommend using gzip directly on a large block of data in one shot. Many times I have compressed entire drives using something similar to dd if=/dev/sda conv=sync,noerror | gzip > /media/backup/sda.gz and the data was unusable when I tried to restore it. I have reverted to not using compression

  • I just ran into the same problem: I piped dd output into gzip and, upon decompressing the resulting archive `foo.gz` with gzip again, ended up with `gzip foo.gz: unexpected end of file`. Since the compression algorithm works flawlessly in webservers millions of times a day, while there are countless Google results for "gzip unexpected end of file", I suspect that there must be a bug somewhere in the gzip command line tool. … – balu Apr 08 '21 at 10:36
  • …Meanwhile the zlib library (used in most webservers; ["essentially the same algorithm as that in gzip"](https://zlib.net/) but [not quite](https://stackoverflow.com/questions/48412329/is-deflate-used-in-gzip-and-png-compression-the-same/48432456#48432456)) probably does not have that bug. – balu Apr 08 '21 at 10:36
0

gzip is constantly being used all around the world and has gathered a very strong reputation for reliability. But no software is perfect. Nor is any hardware, nor is S3. Whether you need to verify the data ultimately depends on your needs, but I think a hard disk failure is more likely than a gzip corruption at this point.

Aaron Bentley
  • 1,332
  • 8
  • 14
  • 1
    I'm doing compression in memory and have MD5 for uncompressed data so I'm not that worried about bit flips caused by the bad disk. Currently, I'm mostly concerned about any issues caused by compression. – user401445 Oct 26 '17 at 20:38
  • user401445, The gzip (deflate) is lossless, but it may have issue of slow compression speed. Check some more modern compression methods and parallel implementations (both compression and decompression) like zstd / pzstd. – osgx Oct 26 '17 at 20:56
0

GZIP compression, like just about any other commonly-used data compression algorithm, is lossless. That means when you decompress the compressed data, you get back an exact copy of the original (and not something kinda sorta maybe like it, like JPEG does for images or MP3 for audio).

As long as you use a well-known program (like, say, gzip) to do the compression, are running on reliable hardware, and don't have malware on your machine, the chances of compression introducing data corruption are basically nil.

cHao
  • 84,970
  • 20
  • 145
  • 172
0

If you care about this data, then I would recommend compressing it, and the comparing the decompression of that with the original before deleting the original. This checks for a bunch of possible problems, such as memory errors, mass storage errors, cpu errors, transmission errors, as well as the least likely of all of these, a gzip bug.

Something like gzip -dc < petabytes.gz | cmp - petabytes in Unix would be a way to do it without having to store the original data again.

Also if loss of some of the data would still leave much of the remaining data useful, I would break it up into pieces so that if one part is lost, the rest is recoverable. Any part of a gzip file requires all of what precedes it to be available and correct in order to decompress that part.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158