Sorted word count using Hadoop MapReduce

Question

I'm very much new to MapReduce and I completed a Hadoop word-count example.

In that example it produces unsorted file (with key-value pairs) of word counts. So is it possible to sort it by number of word occurrences by combining another MapReduce task with the earlier one?

This question is quite old, so I'll just comment: It can be done very easily in pig: a = load '/out/wordcount' as (word:chararray, num:int); b = order a by num; store b into '/out/wordcount-sorted'; — wlk, Jun 21 '10 at 13:09

Neo · Answer 1 · 2013-04-21T04:12:29.443

In simple word count map reduce program the output we get is sorted by words. Sample output can be :
Apple 1
Boy 30
Cat 2
Frog 20
Zebra 1
If you want output to be sorted on the basis of number of occrance of words, i.e in below format
1 Apple
1 Zebra
2 Cat
20 Frog
30 Boy
You can create another MR program using below mapper and reducer where the input will be the output got from simple word count program.

class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text>
{
    public void map(Object key, Text value, OutputCollector<IntWritable, Text> collector, Reporter arg3) throws IOException 
    {
        String line = value.toString();
        StringTokenizer stringTokenizer = new StringTokenizer(line);
        {
            int number = 999; 
            String word = "empty";

            if(stringTokenizer.hasMoreTokens())
            {
                String str0= stringTokenizer.nextToken();
                word = str0.trim();
            }

            if(stringTokenizer.hasMoreElements())
            {
                String str1 = stringTokenizer.nextToken();
                number = Integer.parseInt(str1.trim());
            }

            collector.collect(new IntWritable(number), new Text(word));
        }

    }

}


class Reduce1 extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>
{
    public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> arg2, Reporter arg3) throws IOException
    {
        while((values.hasNext()))
        {
            arg2.collect(key, values.next());
        }

    }

}

so ... are we supposed to add this below the original code ? can you please tell me more specific ? I'm new to java and new to hadoop ... here's my code : http://stackoverflow.com/questions/28785337/how-to-re-arrange-wordcount-hadoop-output-result-and-sort-them-by-value — JPC, Feb 28 '15 at 19:30
"In simple word count map reduce program the output we get is sorted by words". That is not true — vefthym, Aug 19 '16 at 14:29

Marco167 · Answer 2 · 2013-12-10T15:47:12.267

As you have said, one possibility is to write two jobs to do this. First job: Simple wordcount example

Second job: Does the sorting part.

The pseudo code could be:

Note : The output file generated by the first job will be the input for the second job

    Mapper2(String _key, Intwritable _value){
    //just reverse the position of _value and _key. This is useful because reducer will get the output in the sorted and shuffled manner.
    emit(_value,_key);
    }

    Reduce2(IntWritable valueofMapper2,Iterable<String> keysofMapper2){
//At the reducer side, all the keys that have the same count are merged together.
        for each K in keysofMapper2{
        emit(K,valueofMapper2); //This will sort in ascending order.
        }

    }

You can also sort in descending order for which it is feasible to write a separate comparator class which will do the trick. Include comparator inside the job as:

Job.setComparatorclass(Comparator.class);

This comparator will sort the values in descending order before sending to the reducer side. So on the reducer, you just emit the values.

score 0 · Answer 3 · answered Apr 06 '10 at 03:10

The output from the Hadoop MapReduce wordcount example is sorted by the key. So the output should be in alphabetical order.

With Hadoop you can create your own key objects that implement the WritableComparable interface allowing you to override the compareTo method. This allows you to control the sort order.

To create an output that is sorted by the number of occurances you would probably have to add another MapReduce job to process the output from the first as you have said. This second job would be very simple, maybe not even requiring a reduce phase. You would just need to implement your own Writable key object to wrap the word and its frequency. A custom writable looks something like this:

 public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
     }

I grabbed this example from here.

You should probably override hashCode, equals and toString as well.

Is the compareTo method related to given example ? – Balaji Boggaram Ramanarayan Apr 15 '14 at 19:56 — Balaji Boggaram Ramanarayan, Apr 15 '14 at 19:56

score 0 · Answer 4 · answered Apr 28 '10 at 17:54

In Hadoop sorting is done between the Map and the Reduce phases. One approach to sort by word occurance would be to use a custom group comparator that doesn't group anything; therefore, every call to reduce is just the key and one value.

public class Program {
   public static void main( String[] args) {

      conf.setOutputKeyClass( IntWritable.class);
      conf.setOutputValueClass( Text.clss);
      conf.setMapperClass( Map.class);
      conf.setReducerClass( IdentityReducer.class);
      conf.setOutputValueGroupingComparator( GroupComparator.class);   
      conf.setNumReduceTasks( 1);
      JobClient.runJob( conf);
   }
}

public class Map extends MapReduceBase implements Mapper<Text,IntWritable,IntWritable,Text> {

   public void map( Text key, IntWritable value, OutputCollector<IntWritable,Text>, Reporter reporter) {
       output.collect( value, key);
   }
}

public class GroupComaprator extends WritableComparator {
    protected GroupComparator() {
        super( IntWritable.class, true);
    }

    public int compare( WritableComparable w1, WritableComparable w2) {
        return -1;
    }
}

@minghan nope, [`Comparator`](http://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html) requires `compare` — asgs, Jun 07 '17 at 18:06

Sorted word count using Hadoop MapReduce

4 Answers4

Linked