0

This is very beginner question but i do not able to find answer. What is the actual time taken by mapreduce program?
Is it "Finished in" time as shown on the first link bellow. What is this CPU time Spent as shown on second link. As you can see that CPU time spent is very less than "Finished in" time so which time to take as total running time of code. Is there is any relevance between CPU time spent, "finished in" time and actual time taken by mapreduce program?

First Snapshot
Second Snapshot

Ronak Patel
  • 3,819
  • 1
  • 16
  • 29
user3464093
  • 115
  • 1
  • 6

1 Answers1

0

Finished time is the time taken by the program from when you start the process and it finally returns. During this time, it is not necessary the process is consuming any CPU cycles. It could be switched between by the process-scheduler to execute something else and your program might be sitting idle (waiting on some signal/flag or it simply completed its slice of CPU-time allotted to it).

So, CPU-time + Idle-time = Finish-time. (pretty much)

In the driver class, beside running the mapreduce jobs, you run a lot of other code. What you should be actually looking at is, how much time the mapreduce job is taking to process the data you set it to run. So, if you wish to estimate the time taken for certain amount of data, (which depends on processing power - the virtual cores available to the ResourceManager and the memory - also managed by the ResourceManager) you should be looking at the yarn statistics given in second image. Again, for a fixed data size Your mapreduce job will not always have same finish-time (as in image-1), it will vary upon the availability of resources. Less resources means more idle-time. But the the stats in image-2 (CPU-time) for given fixed amount of data should remain same.

The summation of mapper and reducer is the total CPU time. They are equal. CPU-time is the time taken by the mapreduce application to run. The mapreduce application consists of - Mapper and Reducer. In turn Mapper has tasks like to read the input files (consisting records) and loop each record through the map function. After that it has combiner and partitioner. This data then enters the reduce phase, where each partition(partitioned according map output key values) is looped through the reduce function. And the reduce function returns the final output. Before that the reduce also does shuffle and sorting. So, the CPU-time you see is for the above whole process.

ViKiG
  • 764
  • 9
  • 21
  • Thanks for answering. I have to find the running time of my mapreduce code. So which time I consider to show as total code running time. – user3464093 Jun 29 '16 at 10:16
  • total code running time - from where to where? from running the main function (driver class) to the end until it returns or inside the yarn that is running the application ? – ViKiG Jun 29 '16 at 10:18
  • I run mapreduce code by making its jar file and then type command hadoop jar. After some time it gives me output in hdfs file format. So, I want to find out the time which is elapsed between running jar file and getting the output – user3464093 Jun 29 '16 at 10:31
  • Usually the statistical information shown in the images is for the yarn job which runs as its own process (not the driver class jar process). So if you want to know the time taken by the jar executable to return, you could simply use `time -v HADOOP_COMMAND` in linux – ViKiG Jun 29 '16 at 10:41
  • I am running my code on single node cloudera. I pick up iso provided by them and just run in my virtual machine. When I run my jar file from command line, the same process starts at yarn with same job id. As my job end, statistics of yarn and what I get ouput on bash are same. As you can see in second snapshot, that is yarn output. The same output I get on my bash terminal after running my mapreduce code. – user3464093 Jun 29 '16 at 16:07
  • What is the question? Be precise. – ViKiG Jun 30 '16 at 08:03
  • Question in simple. I want to find out timing of mapreduce code when it is run on single node cloudera. – user3464093 Jun 30 '16 at 09:08
  • These two snapshot are output of my job Tracker. From these two output, Can I find out timing of the mapreduce code that I run as a jar file. – user3464093 Jun 30 '16 at 09:10
  • So, can I take CPU time as my mapreduce code running time? – user3464093 Jul 01 '16 at 07:44
  • See this post for more information on CPU time http://stackoverflow.com/questions/11726388/what-does-cpu-time-for-a-hadoop-job-signify – user3464093 Jul 01 '16 at 07:45
  • Yes, the second image has the CPU time. It is what you should care about. The top answer in the other post also says what I said here. When you are scaling up your cluster size then you can start worrying about the finish-time. – ViKiG Jul 01 '16 at 09:49
  • Can I take summation of total mapper time and total reducer time as my mapreduce output time or I have to only choose cpu time. – user3464093 Jul 01 '16 at 13:03
  • Also there is confusion about cpu time. Can you elaborate? – user3464093 Jul 01 '16 at 13:04
  • I explained them in my answer. It is very simple `total_mapper_time + total_reducer_time = cpu_time`. – ViKiG Jul 04 '16 at 06:35
  • This is not the case as you can see in the second snapshot. Total map reduce time is approximately 1393 sec but cpu time spent is just 1154 sec – user3464093 Jul 04 '16 at 06:40
  • You see slots occupied for mappers and reducers is the number of mapper and reducer containers that are used to run the job. Containers are nothing but again resources (CPU cores and RAM). So all mappers and reducers held the slots (not the resources) for a time of 1393 secs that includes the ((cpu-time=1154 secs) + idle-time = 1393 secs). That idle time occurs because even though the mapper/reducer is given a CPU, there will be context switches (OS scheduler does it). ResourceManager can distribute the resources between mapreduce jobs but can not lock the CPU away from actual OS. – ViKiG Jul 04 '16 at 07:01
  • Thanks for replying. One thing I suggest you that whenever I asked you something you just edited the answer. I never come to know that you have already answered there. So, in case you provided the answer please post the comment that i have edited that in answer so that OP do not ask same question again. – user3464093 Jul 04 '16 at 07:16
  • Can you suggest me? I have made java codes for generating large primes of 1000 bit on local machine and on hadoop framework. When I run the local code its running time is 703 seconds. – user3464093 Jul 05 '16 at 12:25
  • While when I run my mapreduce code its cpu time is 653 seconds while total map reduce time in slots are 770 second. – user3464093 Jul 05 '16 at 12:26
  • Now idle time is included both in local version and in hadoop version. So for comparing the performance will I take total map reduce time or cpu time? – user3464093 Jul 05 '16 at 12:28
  • The local code took 703 seconds, it includes the idle time also. So, I would say the local code had performed well. Since local code CPU time might be less than or equal to 653 secs. Now, you are distributing computation over hadoop cluster but what is your data? Is it feasible to actually use hadoop for your problem? Or simply compute the problem using simple parallelism. These are the questions you need to ask yourself. First decide, what are resources required (eg. CPU and data storage) in your problem and how large are they. Generally, hadoop performs bad on low data sizes. – ViKiG Jul 05 '16 at 13:28