how to increase the number of map tasks in hadoop and how to get total time taken by hadoop mapreduce job

Question

I have a dataset which im trying to analyze in hadoop. As far as I have done, its running smoothly in small amount of data.

1st Query:
I want to test this on large data and find out how much time it takes to complete the task when the file size is increased. How to get how many seconds it takes to complete the task? Is there any cmd line syntax or as such?

2nd Query:
dfs.replication is set to 1 in hdfs-core.xml file. Does it only replicates the input data, or does it have some effect on map reduce job?

3rd Query:
Now, I have a single-node hadoop cluster. How to know the exact number of mappers it produces for a given input file and how can I change the no. of mappers? Actually, I want to get the time it takes to complete the tasks under different no. of mappers.

For example: First, I want to test the data with 10 mappers, then 20 and so on, so that I can get how much time it takes to complete the task under different no. of mappers.

score 1 · Answer 1 · answered Oct 27 '14 at 06:40

1

3rd query :

You can play around with block size .

By default if you don't configure block size in hadoop 1.x its 64 MB

Hadoop 2.x its 128 MB

Suppose you have file of 1 GB if block size is 64 MB ,so by default if you have any configured anything for input split size then your input split size would be equivalent to block size so 16 splits of 64 mb each would be there for 1 GB and corresponding 1 mapper of each split means 16 mapper would be invoked for 1 Gb of data

if you change block size to 128 mb so 8 mapper would be used similarly for 256mb block size 4 and for 512 mb block size 2 mapper would be used .

2nd Query : Replication factor can improve your map -reduce task performance because if data would be replicated properly so task tracker can straight way run on the block otherwise it will have to copy that block from other node would can use network bandwidth and hence degrade performance .

1st Query :

Once any job completes at end of that job it has all the statistics like how many mappers and how many reducers were used ,how many bytes written and how long it took to execute and it has all the details .

answered Oct 27 '14 at 06:40

user3484461

1,113
11
14

how to configure the input split size? I mean how to increase the block size? – numanumu Oct 27 '14 at 06:43
you need not to configure any thing for split size it would automatically be calculated by fileinputformat on the basis of block size ,if you want to keep split size different from block size then you need to do some configuration ,but it is not recommended it will degrade your performance – user3484461 Oct 27 '14 at 06:46
for hadoop 2.x it's 128MB then i think if i decrease it to say 64 MB then there will be 16 splits so 16 mappers then it will increase the performance because 16 mappers would work parallely instead of 8 mappers. am i right? or my concept is wrong? – numanumu Oct 27 '14 at 06:51
In a small cluster (6-7 nodes) the map task creation overhead is considerable. So dfs.block.size should be large in this case but small enough to utilize all the cluster resources. The block size should be set according to size of the cluster, map task complexity, map task capacity of cluster and average size of input files. If the map contains the computation such that one data block is taking much more time than the other block, then the dfs block size should be lesser. – user3484461 Oct 27 '14 at 09:11

score 1 · Answer 2 · edited May 23 '17 at 12:27

1st Query
I'm not sure about the cmd syntax, but you can use the java api itself after the job completion. eg :

job.waitForCompletion(false);
if(job.isSuccessful()){
   System.out.println("completionTime :" 
    + (job.getFinishTime() - job.getStartTime())/1000 + "s");
}

2nd Query
It will effect the job performance. Because now the job won't be able take advantage of locality of data as much as it would when the replication factor was 3. Data have to be transferred to taskTrackers where slots are available, thus ending up in more network IO & degraded performance.

3rd Query
The number of mappers is always equal to the number of input spits. The orthodox way is to write a custom InputFormat which spilts the data file based on the specified criteria. Say you have a 1GB file & you want 5 mappers, just let the InputFormat to do spilts on 200MB (which will consume more than 3 blocks on default 64 MB block size).

On the other hand, use the default InputFormat and split the file manually to the number of mappers you want before submitting the job. For this the constraint is that each sub-file should have a size less than or equal to the block size. So for 5 mappers you can use upto a total 5*64=320MB fileSize.

The third way to change blocksize can solve the issue without these troubles but is not advisable at all. Because it requires the cluster restart each time.

UPDATE
The easiest, and most probably the best solution for 3rd query is to use the mapred.max.split.size configurations per job basis. To run 5 maps for a 1GB file, before job submission do something like :

conf.set("mapred.max.split.size", "209715200"); // 200*1024^2 bytes

Pretty simple, ha. And again there is another property mapred.min.split.size, still I'm confused a bit about its use. This SE post may help you in this regard.

Instead you may also take advantage of the -D option when running the job. eg :

hadoop jar job.jar com.test.Main -Dmapred.map.max.split.size=209715200

NB : These properties get deprecated in Hadoop 2.5.0. Have a look if are using it.

how to define my own custom InputFormat , it has a method InputSplit but how to use it? — numanumu, Oct 27 '14 at 15:11
Actually when searching for `InputSplit` I got the best solution. No need to take all the troubles to extend `InputFormat` & blah blah. It also clears a few of my misunderstandings. Updated the answer, have a look. — blackSmith, Oct 28 '14 at 07:10

score 0 · Answer 3 · answered Oct 30 '14 at 17:24

@namanamu,
Query 1:
if you are using a seperate driver Class, then you can use Java timer to know how much time it is taking by adding your main code between long start = System.currentTimeMillis(); and long stop = System.currentTimeMillis(); and time taken is (stop-start)/1000 seconds.

Query 3: When you execute a job through command-line using hadoop jar myfile.jar, in the end you will find all properties like no. of Mappers, Reducers, Input groups, Reduce Groups and all other info.

how to increase the number of map tasks in hadoop and how to get total time taken by hadoop mapreduce job

3 Answers3