Need help in writing Map/Reduce job to find average

Question

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:

ProcessName Time
process1    10
process2    20
processn    30

I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

Thanks.

Possible duplicate of [Find the average of numbers using map-reduce](http://stackoverflow.com/questions/10667575/find-the-average-of-numbers-using-map-reduce) — kellyfj, Mar 18 '17 at 14:09

jason · Answer 1 · 2013-08-05T16:37:40.467

Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like

ProcessName Time
process1    10
process2    20
.
.
.

Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like

private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
    String[] fields = value.split("\t");
    output.set(Integer.parseInt(fields[1]));
    context.write(one, output);
}

Your reducer takes these values, and simply computes the average. This would look something like

IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
    int sum = 0;
    int count = 0;
    for(IntWritable value : values) {
        sum += value.get();
        count++;
    }
    average.set(sum / (double) count);
    context.Write(key, average);
}

I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.

Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Thanks Jason. Just one more thing. I've set up a hadoop cluster with 1 job tracker and 3 other task trackers. So do I need to have the input file in all task tracker or is it enough if I have it in the job tracker alone? And is there any way of finding out whether the job is distributed evenly to all the slave nodes? — Anon, Aug 05 '13 at 16:41
HDFS and Hadoop will handle that for you. You can monitor a job by opening port 50030 (that's the *default* port at least) on your Hadoop job tracker node and from there can access your task trackers. Note that since you have exactly one output key from your map job, your reduce task will only run on one node. — jason, Aug 05 '13 at 17:32

contradictioned · Accepted Answer · 2018-09-25T11:38:54.510

3

Your Mappers read the text file and apply the following map function on every line

map: (key, value)
  time = value[2]
  emit("1", time)

All map calls emit the key "1" which will be processed by one single reduce function

reduce: (key, values)
  result = sum(values) / n
  emit("1", result)

Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.

Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines. A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:

combine: (key, values)
  emit(key, sum(values))

This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between. The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.

edited Sep 25 '18 at 11:38

answered Aug 05 '13 at 16:13

contradictioned

1,253
2
14
26

Thanks. Just one more thing. Is there any way of finding out whether the job is distributed evenly to all the slave nodes? I have a cluster with 1 master and 3 slave nodes. – Anon Aug 05 '13 at 16:43
The file you put into HDFS is split into several blocks and these blocks are replicated on your cluster (see: http://hadoop.apache.org/docs/stable/hdfs_design.html ). Then the mapper classes are instantiated on every cluster for every block. By chance two mappers can work on the same block on different clusters, then by chance one "wins" and the other is aborted and the interemediate results are thrown away. If you want to analyze your jobs, you have to look at the logs of your job. The jobtracker's webinterface provides some statistics. – contradictioned Aug 05 '13 at 16:47
This is an example of how useless can be a bad-designed Hadoop algorithm. – logi-kal Sep 25 '18 at 10:47
@horcrux Please elaborate. – contradictioned Sep 25 '18 at 10:50
1

@contradictioned You are summing all the values in the reducer, without pre-summing in the mappers or, better, in the combiners. So the whole job is done by one single worker. Therefore, you need the same computational power of a stupid avg() method that could be executed by a single machine. Rather, you are just adding networking overhead. – logi-kal Sep 25 '18 at 10:57
While you are right that specifying a combiner helps with efficient execution of such a job, I think the first step should be to get an understanding of the basic model of a map-reduce job. Only after that, one can look at the system at hand and start optimizing. – contradictioned Sep 25 '18 at 11:31

Need help in writing Map/Reduce job to find average

2 Answers2

Linked