MapReduce output key in ascending order

Question

I have written a MapReduce code for which both keys and values are integers. I am using a single Reducer. The output is like this:

Key    Value
1      78
128    12
174    26
2      44
2957   123
975    91

Is there a way that the output will be sorted by key in ascending order? such that the output looks like this:

Do I need to use conf.setComparator ? If yes, how can I do that?

WestCoastProjects · Answer 1 · 2015-06-16T17:09:41.660

This requires a

TotalOrderPartitioner

https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/partition/TotalOrderPartitioner.html

which enforces an additional stage in the M/R pipeline to partition the elements into sorted buckets.

The TreeMap solution will not work globally but only within each Reducer.

Here is a gist (not mine) showing how to use TotalOrderPartioner: https://gist.github.com/asimjalis/e5627dc2ff2b23dac70b

The key takeaways from the gist are:

a) you need to invoke reducer.setPartitionerClass to TotalOrderPartitioner:

  // Use Total Order Partitioner.
  reduceJob.setPartitionerClass(TotalOrderPartitioner.class);

b) You need to generate a set of splits to be used as the "buckets" for the TOP

  // Generate partition file from map-only job's output.
  TotalOrderPartitioner.setPartitionFile(
      reduceJob.getConfiguration(), partitionPath);
  InputSampler.writePartitionFile(reduceJob, new InputSampler.RandomSampler(
      1, 10000));

Thanks but I'm at a beginner level, and that code is a little difficult for me to understand. — MChirukuri, May 29 '15 at 16:23

score 4 · Accepted Answer · edited May 23 '17 at 12:29

4

I see three options here:

(and preferred) use the answer of javadba (+1 from me). This is more generic, but requires more effort.
If you can, just use a single reducer. This requires that all the data can fit in the memory of a single machine. Then, the input of the single reducer will be sorted in ascending order of key (what you want).
After the job has finished, you can use the getmerge command of hdfs and then sort the merged file manually, e.g., using the sort command of Linux (or even merge-sort the multiple files, without the getmerge command). After all, you don't have to use MapReduce for everything! Be careful to sort based on the key only! For example, you can run:
```
sort -n -k1,1 filename
```
but there are plenty of more sorting options...

As a final note (for completion) all the above assume that you do not use a Map-only job, in which the output is not sorted. If that's the case, I can only see option 3 working.

UPDATE: For future reference and based on the comments, it seems that the output keys were not of type IntWritable, so they were not sorted as integers.

edited May 23 '17 at 12:29

Community

1
1

answered May 29 '15 at 15:48

vefthym

7,422
6
32
58

I used a single reducer. And the sample output I kept in my question is the result of that. – MChirukuri May 29 '15 at 16:22
@Mounika_22 can you also post your reducer code then? Are your keys IntWritable? – vefthym May 29 '15 at 21:06
Previously I have taken the key as text in reducer code but when I changed it to IntWritable, the output was in ascending order. Thanks for mentioning about the type of keys.. it helped. – MChirukuri Jun 01 '15 at 18:37
@Mounika_22 I am glad that it helped. In general, the input of each reducer is sorted by key, based on the comparison that is defined for this specific type of input. So Text sorts lexicographically, IntWritable sorts numerically. – vefthym Jun 02 '15 at 06:54

score 0 · Answer 3 · answered May 29 '15 at 11:28

0

Use a TreeMap. It's created for this:

A Red-Black tree based NavigableMap implementation. The map is sorted according to the natural ordering of its keys, or by a Comparator provided at map creation time, depending on which constructor is used.

answered May 29 '15 at 11:28

Jordi Castilla

26,609
8
70
109

@Mounika_22 You want to refer to the Java documentation for java.util.collections. – Prahalad Deshpande May 29 '15 at 11:33
2

This does not answer the question... The problem here is to sort the output of a MapReduce job, a Java collection won't help. Why does this have upvotes? – Balduz May 29 '15 at 11:35
1

@Balduz yes it exactly answers as per OP – SpringLearner May 29 '15 at 11:48
@SpringLearner so you are saying that the best way to sort a MapRed output job, is to read the bunch of files created in the HDFS into a TreeMap, and then write it again to the HDFS? Very efficient, yes. – Balduz May 29 '15 at 11:49
1

@Balduz As per the question posted here,this answer exactly explains that – SpringLearner May 29 '15 at 11:50
@SpringLearner The question posted here, asks how to sort by key the output of a job. If you know anything about Hadoop (which obviously it's not the case), you will know that it's the framework the one writing the output, not you. Using a TreeMap does not solve the question. You write the output to a `Context`, and Hadoop writes it. – Balduz May 29 '15 at 11:53
1

@Balduz why are you arguing and not answering if you are so sure? – Jordi Castilla May 29 '15 at 11:54
@SpringLearner [this is an answer](http://stackoverflow.com/questions/14322381/mapreduce-job-output-sort-order). What has been posted here, is not. – Balduz May 29 '15 at 11:57
did you notticed in [your question](http://stackoverflow.com/questions/14322381/mapreduce-job-output-sort-order) [java] is not present? – Jordi Castilla May 29 '15 at 11:59
1

@Balduz you should **read carefully and understand** my previous comment – SpringLearner May 29 '15 at 11:59
1

@SpringLearner "I have a written a MapReduce code ... Is there a way that the output will be sorted by key in ascending order?" It's not thinking beyond the question, it's what has been asked. Having the output of a MapReduce job sorted. And Jordi, did you notice that Hadoop is written in Java? – Balduz May 29 '15 at 12:01
2

The TreeMap only works within one Reducer. It does not enforce a global ordering on the output data. – WestCoastProjects May 29 '15 at 13:00
1

@Balduz is right. This answer is slightly irrelevant with the Hadoop framework and the java tag is there, because hadoop also supports python (but it could be omitted in this case). – vefthym May 29 '15 at 15:57

MapReduce output key in ascending order

3 Answers3

TotalOrderPartitioner