3

My problem can be described with following statements:

  • I would like my program to be able to compress and decompress selected files
  • I have very large files (20 GB+). It is safe to assume that the size will never fit into the memory
  • Even after compression the compressed file might still not fit into the memory
  • I would like to use System.IO.Compression.GzipStream from .NET Framework
  • I would like my application to be parallel

As I am a newbie to compression / decompression I had following idea on how to do it:

I could use split the files into chunks and compress each of them separately. Then merge them back into a whole compressed file. Picture showing my idea

Question 1 about this approach - Is compressing multiple chunks and then merging them back together going to give me the proper result i.e. if I were to reverse the process (starting from compressed file, back to decompressed) will I receive the same original input?

Question 2 about this approach - Does this approach make sense to you? Perhaps you could direct me towards some good lecture about the topic? Unfortunately I could not find anything myself.

  • 2
    Why not just pipe the data from a file stream, through a GZip stream, and back out through another file stream? There's no need for either file to fit in memory. – glenebob Apr 23 '18 at 21:23
  • @glenebob and Cory thank you gentlemen, I understand! Am I right to understand that I could have multiple streams compressing the same file, each starting from a different point? I receive N amount of streams which I then merge together? – Radoslaw Jurewicz Apr 23 '18 at 21:34
  • That could work, but the result would not be a valid GZip file. You would have to invent a file format. Each "chunk" would be a valid GZip file in and of itself. Your format would need to describe each chunk to allow later decompression of the individual chunks to occur. – glenebob Apr 23 '18 at 21:39
  • 2
    @CoryNelson gzip compression _can_ be parallelized, but I don't think that's what the question is about. It seems to be just be about controlling memory usage. To be clear, any concatenation of valid gzip streams is also a valid gzip stream. There is no need for a "proprietary container". – Mark Adler Apr 23 '18 at 23:15

1 Answers1

2

You do not need to chunk the compression just to limit memory usage. gzip is designed to be a streaming format, and requires on the order of 256KB of RAM to compress. The size of the data does not matter. The input could be one byte, 20 GB, or 100 PB -- the compression will still only need 256KB of RAM. You just read uncompressed data in, and write compressed data out until done.

The only reason to chunk the input as you diagram is to make use of multiple cores for compression. Which is a perfectly good reason for your amount of data. Then you can do exactly what you describe. So long as you combine the output in the correct order, the decompression will then reproduce the original input. You can always concatenate valid gzip streams to make a valid gzip stream. I would recommend that you make the chunks relatively large, e.g. megabytes, so that the compression is not noticeably impacted by the chunking.

Decompression cannot be chunked in this way, but it is much faster so there would be little to no benefit even if you could. The decompression is usually i/o bound.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158