InputFormat
describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
- Validate the input-specification of the job.
- Split-up the input file(s) into logical
InputSplits
, each of which is then assigned to an individual Mapper.
- Provide the
RecordReader
implementation to be used to glean input records from the logical InputSplit
for processing by the Mapper.
InputSplit
represents the data to be processed by an individual Mapper
.
Have a look at FileInputFormat code to understand how split works.
API:
public List<InputSplit> getSplits(JobContext job
) throws IOException {
The RecordReader breaks the data into key/value pairs for input to the Mapper.
There are multiple RecordReader
types.
CombineFileRecordReader, CombineFileRecordReaderWrapper, ComposableRecordReader,
DBRecordReader, KeyValueLineRecordReader, SequenceFileAsTextRecordReader,
SequenceFileRecordReader
Most frequently used one : KeyValueLineRecordReader
Have a look at related SE question for better understanding on internals of read :
How does Hadoop process records split across block boundaries?