About Hadoop/HDFS file splitting

Question

Want to just confirm on following. Please verify if this is correct: 1. As per my understanding when we copy a file into HDFS, that is the point when file (assuming its size > 64MB = HDFS block size) is split into multiple chunks and each chunk is stored on different data-nodes.

File contents are already split into chunks when file is copied into HDFS and that file-split does not happen at the time of running map job. Map tasks are only scheduled in such a way that they work on each chunk of max. size 64 MB with data-locality (i.e. map task runs on that node which contains the data/chunk)
File splitting also happens if file is compressed (gzipped) but MR ensures that each file is processed by just one mapper, i.e. MR will collect all the chunks of gzip file lying at other data nodes and give them all to the single mapper.
Same thing as above will happen if we define isSplitable() to return false, i.e. all the chunks of a file will be processed by one mapper running on one machine. MR will read all the chunks of a file from different data-nodes and make them available to a single mapper.

A more descriptive title would be a welcome improvement to your question. — Cody Gray - on strike, Feb 13 '12 at 09:51

score 45 · Answer 1 · answered Apr 17 '14 at 05:26

David's answer pretty much hits the nail on its head, i am just elaborating on it here.

There are two distinct concepts at work here, each concept is handled by a different entity in the hadoop framework

Firstly --

1) Dividing a file into blocks -- When a file is written into HDFS, HDFS divides the file into blocks and takes care of its replication. This is done once (mostly), and then is available to all MR jobs running on the cluster. This is a cluster wide configuration

Secondly --

2) Splitting a file into input splits -- When input path is passed into a MR job, the MR job uses the path along with the input format configured to divide the files specified in the input path into splits, each split is processed by a map task. Calculation of input splits is done by the input format each time a job is executed

Now once we have this under our belt, we can understand that isSplitable() method comes under the second category.

To really nail this down have a look at the HDFS write data flow (Concept 1)

HDFS Write Data Flow

The second point in the diagram is probably where the split happens, note that this has nothing to do with running of a MR Job

Now have a look at the execution steps of a MR job

Here the first step is the calculation of the input splits via the inputformat configured for the job.

A lot of your confusion stems from the fact that you are clubbing both of these concepts, i hope this makes it a little clearer.

Does data replication occur during the initial data write/distribution? Or is data first distributed across the DataNodes, and then instructed to replicate from the NameNode post-data write? — sudo soul, Jan 11 '19 at 18:04
From the official Apache guide: 1. "The replication factor can be specified at file creation time and can be changed later." 2. "The NameNode makes all decisions regarding replication of blocks." Hence why I'm confused. — sudo soul, Jan 11 '19 at 18:06
Nevermind, I understand now. Block distribution and replication *does* occur at the time of write, since before the file is stored, its block placement and replication placement is determined by the NameNode. — sudo soul, Jan 11 '19 at 18:33

score 17 · Accepted Answer · answered Feb 13 '12 at 12:06

17

Your understanding is not ideal. I would point out that there are two, almost independent processes: splitting files into HDFS blocks, and splitting files for processing by the different mappers.
HDFS split files into blocks based on the defined block size.
Each input format has its own logic how files can be split into part for the independent processing by different mappers. Default logic of the FileInputFormat is to split file by HDFS blocks. You can implement any other logic
Compression, usually is a foe of the splitting, so we employ block compression technique to enable splitting of the compressed data. It means that each logical part of the file (block) is compressed independently.

answered Feb 13 '12 at 12:06

David Gruzman

7,900
1
28
30

2

Hi David, thanks for this clarification. If my file is suppose say 128 MB. I guess in this case HDFS will keep split this into two chunks, each of size 64 MB (assuming HDFS block size = 64 MB). These may be stored on different machines. Now if I use my own FileInputFormat that just extends TextInputFormat and returns false in "isSplitable()", what will be the behaviour, i.e. will there be just one mapper which will receive both chunks of input (i.e. entire file processed by just one mapper) or will there be two mappers each processing one chunk of file. – sunillp Feb 13 '12 at 17:50
I am confused here, I wanted to try it out but somehow my custom input format is compiling but not running on the test setup. – sunillp Feb 13 '12 at 17:55
2

If isSplitable returns false - file will be processed by one mapper, regardless of the number of blocks. – David Gruzman Feb 13 '12 at 23:10
Confused little bit. If input to the mapreducer job is a file and it is located on storage device (not on HDFS). Now, I have below queries, please help in clarify. 1. will HDFS first split into HDFS blocks before triggering MR job ? since the input data is on storage device but not on HDFS ? 2. If no to 1st question, then is recordreader will read the contents of this file from storage device at runtime to give it to mapper ? – Raj Jun 13 '18 at 06:23

score 3 · Answer 3 · answered Apr 16 '12 at 12:06

3

Yes, file contents are split into chunks when the file is copied into the HDFS. The block size is configurable, and if it is say 128 MB, then whole 128 MB would be one block, not 2 blocks of 64 MB separately.Also it is not necessary that each chunk of a file is stored on a separate datanode.A datanode may have more than one chunk of a particular file.And a particular chunk may be present in more than one datanodes based upon the replication factor.

answered Apr 16 '12 at 12:06

Tariq

34,076
8
57
79

Please read question more carefully - your answer has nothing to do with topic – Yauheni Sivukha May 13 '12 at 00:23
may I ask in what manner?the title itself says - Hadoop/HDFS file splitting. I agree that it is not as good as David's answer. But I don't find anything which is not related to the topic. Please let me know so that I don't repeat the mistake. Thank you. – Tariq Nov 22 '12 at 12:14
1

I think the answer is a good TL;DR! Got my upvote. Thanks! – Bruno Ambrozio Jul 23 '20 at 15:16

About Hadoop/HDFS file splitting

3 Answers3

Linked