2

In this one statement in a answer written that "same job running over the same data but on one 20 node cluster, then a 200 node cluster.Overall, the same amount of CPU time will be used on both clusters" Can someone explain this ?

I've used time command to measure real time. Sometimes i got more cpu time (hadoop counter) than actual real time or vice versa.I know that real time measures actual clock time elapsed and it can be greater or lesser than user+sys.

I'm still not getting what total cpu times measures in hadoop. Regarding time command this answer written it is good to go with user+sys for benchmarks.

  1. As total cpu time taken by process = user+sys then it should be same as total cpu time of hadoop job counter. But i'm getting different results.
  2. Which time should i consider if i'm doing some benchmark kind of tasks in hadoop user+sys or total cpu time spent (hadoop counter)?

note: In apache hive benchmark they have considered real time but it can affected by other processes also. So i can not consider real time.

Community
  • 1
  • 1
Dhruv Kapatel
  • 873
  • 3
  • 14
  • 27

1 Answers1

1

same job running over the same data but on one 20 node cluster, then a 200 node cluster.Overall, the same amount of CPU time will be used on both clusters

This means if a job takes N hour on a 20-node cluster, and M hours on a 200-node cluster, then 20 * N should be equal to M * 200

real time should be your choice, but as your said above, this value may change accordingly, so you should try at least 3 times, and calculate the average as the final result.

Jiacai Liu
  • 2,623
  • 2
  • 22
  • 42
  • okay. Do you have any idea about 1st point why total cpu time in job counter is different than usr+sys? – Dhruv Kapatel Mar 06 '16 at 13:05
  • I think usr+sys = mapreduce cpu time + cpu some other tasks of usr & sys like allocating memory or accessing hardware etc. am i right? – Dhruv Kapatel Mar 06 '16 at 13:22
  • @Dhruv Absolutely NOT. `usr + sys` is the running time of the client process not mapreduce framework – Jiacai Liu Mar 06 '16 at 14:29
  • So for the map reduce in case i want to measure cpu time then i should only consider cpu time written in job counter not that i got from time command right? – Dhruv Kapatel Mar 06 '16 at 18:00