Find top 10 most frequent words excluding “the”, “am”, “is”, and “are” in Hadoop MapReduce?

Question

i am working on WordsCount problem with MapReduce. I have used txt file of Lewis Carroll’s famous Through the Looking-Glass. Its pretty big file. I ran my MapReduce code and its working fine. Now i need i find out top 10 most frequent words excluding “the”, “am”, “is”, and “are”. I have no idea how to handle this issue.

Here is my code

public class WordCount {

public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[^a-zA-Z0-9]", " ").trim().toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
    ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }

        result.set(sum);
        context.write(key, new IntWritable(sum));

    }
}


public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
   /* job.setSortComparatorClass(Text.Comparator.class);*/
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

You couldn't find anything here? https://www.google.com/search?safe=strict&q=mapreduce+top+frequent+words+site%3Astackoverflow.com — OneCricketeer, Sep 09 '18 at 14:47

OneCricketeer · Answer 1 · 2018-09-09T14:52:00.283

^{I'm personally not going to write code until I see an attempt on your end that requires more effort than Wordcount.}

You need a second mapper and reducer to perform a Top N operation. If you used a higher level language such as Pig, Hive, Spark, etc. that's what it would do.

For starters, you can at least filter out the words from itr.nextToken() to prevent the first mapper from ever seeing them.

Then, in the reducer, your output will be unsorted, but you are already getting the sum for all words to some output directory, which is a necessary first step in getting the top words.

The solution to the problem requires you to create a new Job object to read that first output directory, write to a new output directory, and for each line of text in the mapper generate null, line as the output (use NullWritable and Text).

With this, in the reducer, all lines of text will be sent into one reducer iterator, so in order to get the Top N items, you can create a TreeMap<Integer, String> to sort words by the count (refer Sorting Descending order: Java Map). While inserting elements, larger values will automatically get pushed to the top of the tree. You can optionally optimize this by tracking the smallest element in the tree as well, and only inserting items larger than it, and/or track the tree size and only insert items larger than the N'th item (this helps if you potentially have hundreds of thousands of words).

After the loop that adds all elements to the tree, take all the top N of the string values and their counts (the tree is already sorted for you), and write them out from the reducer. With that, you should end up with the Top N items.

This is similar to the solution here https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/TopTenDriver.java — OneCricketeer, Sep 09 '18 at 14:45

Find top 10 most frequent words excluding “the”, “am”, “is”, and “are” in Hadoop MapReduce?

1 Answers1