What does CPU Time for a Hadoop Job signify?

Question

I am afraid I do not understand the timing results of a Map-Reduce job. For example, a job I am running gives me the following results from the job tracker.

Finished in: 1mins, 39sec

CPU time spent (ms) 150,460 152,030 302,490

The entries in CPU time spent (ms) are for Map, Reduce and Total respectively. But, then how is "CPU time spent" being measured and what does it signify? Is this the total cumulative time spent in each of the mappers and reducers assigned for the job? Is it possible to measure other times from the framework such as time for shuffle, sort, partition etc? If so, how?

A second question which bothers me. I have seen some posts here (Link1, Link2 ) which suggest using getTime() in the driver class :

long start = new Date().getTime();
boolean status = job.waitForCompletion(true);
long end = new Date().getTime();
System.out.println("Job took "+(end-start) + "milliseconds");

Is this not doing what the first entry in Job Tracker output provides anyway? Is this necessary? What is the best way to time a hadoop job especially when I want to time IO time, compute time per node/ per stage ?

you should put different questions in different posts – Donald Miner Jul 31 '12 at 06:18 — Donald Miner, Jul 31 '12 at 06:18

score 18 · Answer 1 · answered Jul 31 '12 at 06:25

The map phase consists of: record reader, map, combiner, and partitioner.

The reduce phase consists of: shuffle, sort, reduce, output.

The CPU time you are seeing there is of the entire map phase and the entire reduce phase... not just the function itself. This is kind of confusing terminology because you have the map function and reduce function, which are only a portion of the map phase and reduce phase. This is the total CPU time across all of the nodes in the cluster.

CPU time is hugely different form real time. CPU time is how much time something spent on the CPUs, while real time is what you and I experience as humans. Think about this: assume you have the same job running over the same data but on one 20 node cluster, then a 200 node cluster. Overall, the same amount of CPU time will be used on both clusters, but the 200 node cluster will run 10x faster in real time. CPU time is a useful metric when you have a shared system with lots of jobs running on it at the same time.

I don't know how you would dive deeper to get CPU time in each phase. Using a date timer is probably not what you are looking for though.

@donaldminer the CPU time is indeed not the same as the time we experience, especially if you take the number of nodes into account. But, the cpu time gives you more insight in the energy consumption. As far as I understand, the CPU time includes ALL parts of both the mappers and the reducers. — Jonny5, Jul 10 '15 at 11:01
In my distrubuted mode I have seen --> MapReduce Total cumulative CPU time: 20 minutes 38 seconds 680 msec and Total MapReduce CPU Time Spent: 20 minutes 25 seconds 980 msec but actually if I see the actual time it is just --> Time taken: 47.769 seconds in real world. Is this means if we use the same job try running into single node cluster that would take 20 mins and soo.. ?? please suggest if it is right!! — Indrajeet Gour, Dec 16 '15 at 07:10
@Donald Miner Can you explain me how same job with same data will take same cpu time on 20 node and 200 node cluster ? — Dhruv Kapatel, Mar 05 '16 at 05:48

What does CPU Time for a Hadoop Job signify?

1 Answers1

Linked