Hadoop's input splitting- How does it work

Question

I know brief about hadoop

I am curious to know how does it work.

To be precise I want to know, how exactly it divides/splits the input file.

Does it divides in equal chunks in terms of size?

or it is configurable thing.

I did go through this post, but I couldn't understand

score 1 · Accepted Answer · answered May 23 '12 at 13:10

This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.

There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:

If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)

If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).

score 0 · Answer 2 · answered Feb 13 '15 at 19:46

When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with FileInputFormat. For each of these input splits, a map task must be started.

But you can change the size of input split by configuring following properties:

mapred.min.split.size: The minimum size chunk that map input should be split into.
mapred.max.split.size: The largest valid size inbytes for a file split. 
dfs.block.size: The default block size for new files.

And the formula for input split is:

Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));

You can check examples here.

Hadoop's input splitting- How does it work

2 Answers2

Linked