I am trying to understand how may map reduce jobs get started for a task and how to control the number of MR jobs.
Say I have a 1TB file in HDFS and my block size is 128MB.
For a MR task on this 1TB file if I specify the input split size as 256MB then how many Map and Reduce jobs gets started. From my understanding this is dependent on split size. i.e number of Map jobs = total size of file / split size and in this case it works out to be 1024 * 1024 MB / 256 MB = 4096
. So the number of map task started by hadoop is 4096.
1) Am I right?
2) If I think that this is an inappropriate number, can I inform hadoop to start less number of jobs or even more number of jobs. If yes how?
And how about the number of reducer jobs spawned, I think this is totally controlled by the user.
3) But how and where should I mention the number of reducer jobs required.