When can we init resources for a hadoop Mapper?

Question

I have a small sqlite database (post code -> US city name) and I have a big S3 file of users. I would like to map every user to the city name associated to their postcode.

I follow the famous WordCount.java example but Im not sure how mapReduce works internally:

Is my mapper created once per s3 input file?
Should I connect to the sqlite database on mapper creation ? Should I do so in the constructor of the mapper?

score 1 · Answer 1 · answered Apr 24 '17 at 09:54

1) mapper is created once per 1 split that is usually 128 or 256mb. You can configure split size with this params: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. If input file is less than split size, it all goes in one map task.

2) You can use methods setup and cleanup for configuring resources for the task. setup called once at task start and cleanup called once at the end. So you can make your connection to database in setup method (probably not just connect but load all cities in memory for performance) and close connection (if you decided not to load data, but just connect) in cleanup

score 1 · Accepted Answer · answered Apr 24 '17 at 10:25

MapReduce is a framework for writing application to process the big data in parallel on large clusters of commodity hardware in reliable and fault tolerant manner. MapReduce executes on top of HDFS(Hadoop Distributed File System) in two different phases called map phase and reduce phase.

Answer to your question Is my mapper created once per s3 input file?

Mapper created equals to the number of splits and by default split is created equals to the number of block.

High level overview is something like

input file->InputFormat->Splits->RecordReader->Mapper->Partitioner->Shuffle&Sort->Reducer->final output

Example,

Your input files- server1.log,server2.log,server3.log
InputFormat will create number of Split based on block size(by default)
Corresponding to each Split a Mapper will allocated to work on each split.
To get the line of record from the Split a RecordReader will be there in between Mapper and Split.
Than Partitioner will started.
After Partitioner Shuffle&Sort phase will start.
Reducer
Final output.

Answer to your 2nd Question: Below are the three standard life cycle method of Mapper.

@Override
 protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
   throws IOException, InterruptedException {

  // Filter your data
  }

 }
 @Override
 protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
   throws IOException, InterruptedException {
  System.out.println("calls only once at startup");
 }
 @Override
 protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
   throws IOException, InterruptedException {
  System.out.println("calls only once at end");
 }

For more on input split you may follow this so [http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries] — subodh, Apr 24 '17 at 10:26

When can we init resources for a hadoop Mapper?

2 Answers2