0

As we know in Hadoop's MapReduce, a mapper reads from a block that is stored in a node in the HDFS. But how does the mapper actually read from the block? Does the block send bytes continuously to the mapper until the mapper has reached its split size? Or does it do something else?

If so, which java file does this happen on? Also, I am using Hadoop 2.7.1, just in case.

IFH
  • 161
  • 1
  • 2
  • 14

2 Answers2

2

Hadoop MapReduce job Input Formats contains two main components :

InputSplit : Divide the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks. These fragments are called "splits". Most files, for example, are split up on the boundaries of the underlying blocks in HDFS, and are represented by instances of the FileInputSplit class. The logic behind how to split the file is implemented through InputSplit.

RecordReader : Reads the data from Split and send to the Map-Reduce job. TextInputFormat divides files into splits strictly by byte offsets. A split's end offset can be in the middle of a line, In such case we should implement the logic in RecordReader to read data from the next split until the end of line is reacehed and pass it to the current mapper.

Please refer this link for more details.

donut
  • 790
  • 5
  • 11
  • I checked RecordReader.java and InputSplit.java and there are no piece of code that shows any hint of some sort while loop that reads to the end of the file. – IFH Mar 21 '16 at 22:20
  • Both RecordReader.java and InputSplit.java are interfaces, you have to check the implementation of any input format like TextInputFormat for the class that implements these interfaces. – donut Mar 22 '16 at 06:06
2

InputFormat describes the input-specification for a Map-Reduce job. The Map-Reduce framework relies on the InputFormat of the job to:

  1. Validate the input-specification of the job.
  2. Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
  3. Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

InputSplit represents the data to be processed by an individual Mapper.

Have a look at FileInputFormat code to understand how split works.

API:

public List<InputSplit> getSplits(JobContext job
                                    ) throws IOException {

The RecordReader breaks the data into key/value pairs for input to the Mapper.

There are multiple RecordReader types.

CombineFileRecordReader, CombineFileRecordReaderWrapper, ComposableRecordReader, 
DBRecordReader, KeyValueLineRecordReader, SequenceFileAsTextRecordReader, 
SequenceFileRecordReader

Most frequently used one : KeyValueLineRecordReader

Have a look at related SE question for better understanding on internals of read : How does Hadoop process records split across block boundaries?

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • I transferred a compressed text file to the HDFS, which RecordReader does it use when I run a WordCount job on it? – IFH Mar 21 '16 at 22:15
  • CustomFileInputFormat & CustomLineReader. Have a look at these two artciles: https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/ and http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/ – Ravindra babu Mar 22 '16 at 14:23
  • I seem to not be able to find CustomFileInputFormat and CustomLineReader (Hadoop 2.7.1). I understand that this is what one of the link uses, but which one does Hadoop uses? – IFH Mar 23 '16 at 17:00
  • Custom FileInputFormat and Custom LineReader are not existing classes. New classes have been written and they are available in above links. – Ravindra babu Mar 23 '16 at 17:09