1st Query
I'm not sure about the cmd syntax, but you can use the java api itself after the job completion. eg :
job.waitForCompletion(false);
if(job.isSuccessful()){
System.out.println("completionTime :"
+ (job.getFinishTime() - job.getStartTime())/1000 + "s");
}
2nd Query
It will effect the job performance. Because now the job won't be able take advantage of locality
of data as much as it would when the replication factor was 3. Data have to be transferred to taskTrackers where slots are available, thus ending up in more network IO & degraded performance.
3rd Query
The number of mappers is always equal to the number of input spits. The orthodox way is to write a custom InputFormat
which spilts the data file based on the specified criteria. Say you have a 1GB file & you want 5 mappers, just let the InputFormat
to do spilts on 200MB (which will consume more
than 3 blocks on default 64 MB block size).
On the other hand, use the default InputFormat and split the file manually to the number of mappers you want before submitting the job. For this the constraint is that each sub-file should have a size less than or equal to the block size. So for 5 mappers you can use upto a total 5*64=320MB fileSize.
The third way to change blocksize can solve the issue without these troubles but is not advisable at all. Because it requires the cluster restart each time.
UPDATE
The easiest, and most probably the best solution for 3rd query is to use the mapred.max.split.size
configurations per job basis. To run 5 maps for a 1GB file, before job submission do something like :
conf.set("mapred.max.split.size", "209715200"); // 200*1024^2 bytes
Pretty simple, ha. And again there is another property mapred.min.split.size
, still I'm confused a bit about its use. This SE post may help you in this regard.
Instead you may also take advantage of the -D
option when running the job. eg :
hadoop jar job.jar com.test.Main -Dmapred.map.max.split.size=209715200
NB : These properties get deprecated in Hadoop 2.5.0. Have a look if are using it.