1

I got one paragraph from the book Hadoop: The Definitive Guide, it is as follows:

"DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting."

My question is I cannot understand the reason the author has explained about why gzip does not support splitting. Can someone give me a more detail explanation about this?

As my understanding, if the big file is split to 16 blocks. When one mapper begin to read one block, and in this point, 2 situations may happen:

  1. The mapper cannot the block
  2. or it can read it and then process it but does not know where to put the result to the whole stream

Does one of the above situations will happen or none will happen and there is other logic?

Coinnigh
  • 611
  • 10
  • 18

1 Answers1

1

In order to split a file into pieces for processing, you need two things:

  1. The pieces need to be able to be processed independently.
  2. You need to be able to find where to split the pieces.

The deflate format in its normal usage supports neither. For 1: the deflate format is inherently serial, with every match referring to previously uncompressed data, itself potentially coming from a similar back reference, perhaps all the way to the beginning of the file.

The description you quote doesn't mention that important point.

Though it is a moot point since you don't have 1, for 2: deflate has no apparent markers in the stream to identify block boundaries. To find block boundaries, you would have to decode all of the bits up to the boundary, which would defeat the purpose of splitting the file for independent processing.

That is the point mentioned in your quoted description.

Though this is all true for a normal deflate stream, not prepared for splitting, you can if you like prepare such a deflate stream. The history can be erased at select breakpoints using Z_FULL_FLUSH, which allows independent decompression from that point. It also inserts a visible marker 00 00 ff ff. That's not a very long marker, and could appear by accident in the compressed data. It could be followed by a second flush to insert a second marker giving nine bytes: 00 00 ff ff 00 00 00 ff ff. That is something that Hadoop could use the split the deflate stream.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I completely explained the situation, and you totally didn't get it. – Mark Adler Jul 12 '16 at 01:53
  • Hello, can you pls explain your 1st point by using a easier way? I cannot understand it although I read it many times and search it a lot from the internet, "with every match referring...of the file". – Coinnigh Jul 12 '16 at 10:20
  • You should read about [LZ77](https://en.wikipedia.org/wiki/LZ77_and_LZ78). – Mark Adler Jul 12 '16 at 13:31
  • @MarkAdler Hi Mark, could you explain why `zlib` in Python (which I understand also uses `DEFLATE`) is able to decompress chunks of a stream? Isn't that the same as decompressing splits of a file? An example of `zlib` decompressing chunks is here: https://stackoverflow.com/a/12572031/800735 – cozos Oct 26 '18 at 23:45
  • No, it's not the same. Decompressing a chunk at a time is entirely different from being able to start decompressing somewhere in the middle of the stream. – Mark Adler Oct 27 '18 at 01:15