1

I am trying to understand how may map reduce jobs get started for a task and how to control the number of MR jobs.

Say I have a 1TB file in HDFS and my block size is 128MB. For a MR task on this 1TB file if I specify the input split size as 256MB then how many Map and Reduce jobs gets started. From my understanding this is dependent on split size. i.e number of Map jobs = total size of file / split size and in this case it works out to be 1024 * 1024 MB / 256 MB = 4096. So the number of map task started by hadoop is 4096.
1) Am I right?

2) If I think that this is an inappropriate number, can I inform hadoop to start less number of jobs or even more number of jobs. If yes how?

And how about the number of reducer jobs spawned, I think this is totally controlled by the user.
3) But how and where should I mention the number of reducer jobs required.

samshers
  • 1
  • 6
  • 37
  • 84
  • Possible duplicate of [Setting the number of map tasks and reduce tasks](https://stackoverflow.com/questions/6885441/setting-the-number-of-map-tasks-and-reduce-tasks) – Rahul Sharma Jul 26 '17 at 19:14

1 Answers1

2

1. Yes, you're right. No of mappers=(size of data)/(input split size). So, in your case it would be 4096

  1. As per my understanding ,Before hadoop-2.7 you can only hint system to create number of mapper by conf.setNumMapTasks(int num) but mapper will created by their own. From hadoop-2.7 you can limit number of mapper by mapreduce.job.running.map.limit. See this JIRA ticket

  2. By default number of reducer is 1. You can change it by job.setNumReduceTasks(integer_numer);

You can also provide this parameter from cli -Dmapred.reduce.tasks=<num reduce tasks>

Anurag Yadav
  • 396
  • 3
  • 12