1

I have gzipped data which is stored in DB. Is there a way to concatenate say 50 separate gzipped data into one gzipped output which can be uncompressed? The result should be same as decompressing that 50 items, concatenating them and then gzipping them.

I would like to avoid decompression phase. Is there also some performance benefit of merging already gzipped data instead gzipping whole byte array?

Marka
  • 377
  • 1
  • 4
  • 17

3 Answers3

2

I would assume that merely concatenating any file in a zipped format would prove disastrous as the zipping algorithm has been run on the specific content per file. I think that you would have to manually unzip all, concatenate, then zip again.

Nathan White
  • 1,082
  • 7
  • 21
  • Wrong. You don't have to recompress, and the question is not about the zip format, it's about the gzip format. The downside is a degradation in compression as compared to recompressing all of it together. – Mark Adler Mar 27 '13 at 13:59
  • @MarkAdler I was talking about zipping generically because, as I believe, ALL zipping - regardless of 'type' - apply a zipping algorithm. I also think you'll find that if you read the first paragraph of his question, it reads: "...the result should be same as decompressing that 50 items, concatenating them and then gzipping them." – Nathan White Mar 27 '13 at 14:07
  • It makes no sense to talk about "zipping" generically for a question about concatenation, since the gzip format explicitly permits concatenation, whereas the zip format does not. There is nothing "disastrous" about concatenating gzip streams. The answer needs to be rewritten to not be misleading. If you want to say that you can't get exactly the same _compressed_ stream by concatenating, then you should say that. You will however get the same _decompressed_ data. – Mark Adler Mar 27 '13 at 14:45
  • Also there is no "zipping algorithm" across all things that have "zip" in the name. gzip uses deflate exclusively, whereas zip can use many different compression algorithms, of which deflate is one. – Mark Adler Mar 27 '13 at 14:46
1

Yes, you can concatenate gzip streams, which when decompressed give you the same thing as if you had concatenated the uncompressed data and gzipped it all at once. Specifically:

gzip a
gzip b
cat a.gz b.gz > c.gz
gunzip c.gz

will give you the same c as:

cat a b > c

However compression will be degraded as compared to gzipping the whole thing at once, especially if each of your 50 pieces are small, e.g. less than several 10's of K bytes. The compressed result will always be different, and a little or a lot larger depending on the size of the pieces.

The comment in another answer about GZIPStream should be heeded. I also recommend that you use DotNetZip instead.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Will this concatenation be faster than gzipping everything from the start? These files are about 5-10k compressed, but they are xml, so tags should be shared across all these files. I believe there shouldn't be so much degradation of compression because of that. – Marka Mar 27 '13 at 21:57
  • Yes, concatenation is much faster than compressing. However 5-10K pieces are quite small, small enough that I expect you would see a significant improvement in compression if you recompressed everything in a single stream. – Mark Adler Mar 28 '13 at 01:47
0

GZip is buggy, moreso decompressing a gzip file which itself has multiple gzip members is buggy... Not all of gzips bugs have been ironed out even in .net 4.5

Furthermore consider what machine each gzip was created on, i.e. is it a BGZF "Blocked GNU Zip Format"? which complicates the issue at hand.

Furthermore the resulting gzip file can be bigger than if you had concatenated all the uncompressed individual files together (gzip isn't a very good compression algorithm set).

I recommend you use DotNetZip instead if it isn't too late.

GZipStream is not really built to handle multiple files, however you can use System.IO.BinaryWriter and System.IO.BinaryReader to gain full control, although it can get messy. DotNetZip just works! It is designed to handle multiple files.

P.S. GZipStream works for file sizes up to 8GB with .Net 4, although earlier versions have a lower limit, e.g. GZipStream works for file sizes up to 4GB with .Net 3.5

Paul Zahra
  • 9,522
  • 8
  • 54
  • 76
  • Is it possible to use DotNetZip to stream the data (200 concatenated files, I understand) from the [Google Freebase gz file](https://developers.google.com/freebase/data)? File size is about 25 GB. Or, is it definitively necessary to un-gzip the file into its original 250GB form before processing it? This relates to [my question on another thread](http://stackoverflow.com/questions/21868658/c-sharp-parsing-of-freebase-rdf-dump-yields-only-11-5-million-n-triples-instead). – Krishna Gupta Feb 20 '14 at 06:04