I got one paragraph from the book Hadoop: The Definitive Guide, it is as follows:
"DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting."
My question is I cannot understand the reason the author has explained about why gzip does not support splitting. Can someone give me a more detail explanation about this?
As my understanding, if the big file is split to 16 blocks. When one mapper begin to read one block, and in this point, 2 situations may happen:
- The mapper cannot the block
- or it can read it and then process it but does not know where to put the result to the whole stream
Does one of the above situations will happen or none will happen and there is other logic?