MapReduce is a framework for writing application to process the big data in parallel on large clusters of commodity hardware in reliable and fault tolerant manner. MapReduce executes on top of HDFS(Hadoop Distributed File System) in two different phases called map phase and reduce phase.
Answer to your question Is my mapper created once per s3 input file?
Mapper created equals to the number of splits
and by default split is created equals to the number of block.
High level overview is something like
input
file->InputFormat->Splits->RecordReader->Mapper->Partitioner->Shuffle&Sort->Reducer->final
output
Example,
- Your input files- server1.log,server2.log,server3.log
- InputFormat will create number of Split based on block size(by default)
- Corresponding to each Split a Mapper will allocated to work on each split.
- To get the line of record from the Split a RecordReader will be there in between Mapper and Split.
- Than Partitioner will started.
- After Partitioner Shuffle&Sort phase will start.
- Reducer
- Final output.
Answer to your 2nd Question:
Below are the three standard life cycle method of Mapper.
@Override
protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// Filter your data
}
}
@Override
protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("calls only once at startup");
}
@Override
protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("calls only once at end");
}