Hadoop input split for a compressed block

Question

If i have a compressed file of 1GB which is splittable and by default the block size and input split size is 128MB then there are 8 blocks created and 8 input split. When the compressed block is read by map reduce it is uncompressed and say after uncompression the size of the block becomes 200MB. But the input split for this assigned is of 128MB, so how is the rest of the 82MB processed.

Is it processed by the next input split?
Is the same input split size is increased?

score 1 · Answer 1 · answered Nov 01 '15 at 08:16

Here is my understanding:

Lets assume 1 GB compressed data = 2 GB decompressed data so you have 16 block of data, Bzip2 knows the block boundary as a bzip2 file provides a synchronization marker between blocks. So bzip2 splits data into 16 blocks and sends the data to 16 mappers. Each mapper gets decompressed data size of 1 input split size = 128 MB. (of-course if data is not exactly multiple of 128 MB, last mapper will get less data)

Ravindra babu · Answer 2 · 2015-10-26T15:20:07.493

Total file size : 1 GB

Block size : 128 MB

Number of splits: 8

Creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way. For this reason, gzip does not support splitting.

MapReduce will does not split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 8 HDFS blocks, most of which will not be local to the map.

Have a look at : this article and section name: Issues about compression and input split

EDIT: ( for splittable uncompression)

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper)

Source: https://issues.apache.org/jira/browse/HADOOP-4012

score 0 · Answer 3 · answered Oct 25 '15 at 17:05

0

I am here referring to the compressed files that can be split-table like bzip2 which is splittable. If an input split is created for 128MB block of bzip2 and during map reduce processing when this is uncompressed to 200MB, what happens?

answered Oct 25 '15 at 17:05

ZAHEER AHMED

507
1
5
9

You will lose data locality as data can't be read in same Map node. Data will be expanded to other nodes too. The physical split spans multiple data nodes. – Ravindra babu Oct 25 '15 at 17:36
The input splits are decided before the data uncompressed. So according to my above question initially i will be having 8 input split for a compressed 1GB file of 128MB. But when these are uncompressed 128MB block would become 200MB. But my input split can process only 128MB so after the block data is uncompressed will they be a increase in input split from 1 to 2.The first one will process 128MB and the second one will process the rest 82MB. – ZAHEER AHMED Oct 25 '15 at 18:31

Hadoop input split for a compressed block

3 Answers3

Linked