0

I have to modify the hadoop wordcount example, to count the number of words that start with the prefix "cons" and then need to sort the results in the descending order of their frequency. Can anybody tell how to write the mapper and reducer code for this?

Code:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> 
{ 
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
    { 
        //Replacing all digits and punctuation with an empty string 
        String line =  value.toString().replaceAll("\\p{Punct}|\\d", "").toLowerCase();
        //Extracting the words 
        StringTokenizer record = new StringTokenizer(line); 
        //Emitting each word as a key and one as itsvalue 
        while (record.hasMoreTokens()) 
            context.write(new Text(record.nextToken()), new IntWritable(1)); 
    } 
}
jordanhill123
  • 4,142
  • 2
  • 31
  • 40
  • public class WordCountMapper extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //Replacing all digits and punctuation with an empty string String line = value.toString().replaceAll("\\p{Punct}|\\d", "").toLowerCase(); //Extracting the words StringTokenizer record = new StringTokenizer(line); //Emitting each word as a key and one as itsvalue while (record.hasMoreTokens()) context.write(new Text(record.nextToken()), new IntWritable(1)); } } – blackbookstar Oct 02 '14 at 23:17
  • in this code need to modify the code for counting the number of words whic start with "cons" – blackbookstar Oct 02 '14 at 23:18
  • below is the link I am providing for hadoop wordcount code. http://wiki.apache.org/hadoop/WordCount – blackbookstar Oct 02 '14 at 23:23
  • I think the code for mapper will be the same on as in the above link, but the code will be changing only for reducer. Can anybody tell how to write the reducer code ? need some modifications in the reducer code – blackbookstar Oct 02 '14 at 23:25

1 Answers1

0

To count the number of words that starts with "cons", you can just discard all other words while emitting from mapper.

public void map(Object key, Text value, Context context) throws IOException,
        InterruptedException {
    IntWritable one = new IntWritable(1);
    String[] words = value.toString().split(" ");
    for (String word : words) {
        if (word.startsWith("cons"))
              context.write(new Text("cons_count"), one);
    }
}

The reducer will now receive only one key = cons_count and you can sum up the values to get the count.

To sort the words starting with "cons" in based on the frequency, the words starting with cons should go to same reducer and reducer should sum it up and sort it. To do that,

public class MyMapper extends Mapper<Object, Text, Text, Text> {


@Override
public void map(Object key, Text value, Context output) throws IOException,
        InterruptedException {
      String[] words = value.toString().split(" ");
      for (String word : words) {
        if (word.startsWith("cons"))
              context.write(new Text("cons"), new Text(word));
    }
 }
}

Reducer :

public class MyReducer extends Reducer<Text, Text, Text, IntWritable> {

@Override
public void reduce(Text key, Iterable<Text> values, Context output)
        throws IOException, InterruptedException {
    Map<String,Integer> wordCountMap = new HashMap<String,Integer>();
    for(Text value: values){
        word = value.get();
        if (wordCountMap.contains(word) {
           Integer count = wordCountMap.get(key);
           count++;
           wordCountMap.put(word,count)
        }else {
         wordCountMap.put(word,new Integer(1));
        }
    }

    //use some sorting mechanism to sort the map based on values.
    // ...

    for (Map.Entry<String, Integer> entry : wordCountMap.entrySet()) {
        context.write(new Word(entry.getKey(),new IntWritable(entry.getValue());
    } 
}

}

vishnu viswanath
  • 3,794
  • 2
  • 36
  • 47
  • the second mapper code is the right one what we needed. to eliminate all the other words except that starts with "cons". hadoop sorts the intermediate key value pairs by their keys and output is sorted in ascending order.here we have to write our custom sorting comparator for descending order of words that start with cons. – blackbookstar Oct 03 '14 at 21:42
  • @blackbookstar by entire code you mean the sorting? check this link on how to do that : http://stackoverflow.com/questions/109383/how-to-sort-a-mapkey-value-on-the-values-in-java – vishnu viswanath Oct 05 '14 at 09:45