Write a MapReduce program in Hadoop that counts the number of times every unique 5-word sequence occurs in the Sample.txt file provided. The final output of your program should list in separate lines the 5-word sequences and their counts.
Example:
Sam is a good boy and he always stands in top five rankings in his school.
This has to be processed as:
- Sam is a good boy : 1
- is a good boy and : 1
- a good boy and he : 1
- good boy and he always : 1
- boy and he always stands : 1
. . . similarly, it goes on If it finds a repeating string of 5-word sequence it must be shown as 2 times
MY CODE:
public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException{
StringBuilder sb = new StringBuilder();
StringTokenizer itr = new StringTokenizer(value.toString());
String[] tokens = new String[itr.countTokens() * 5]
for(int l = 0 ; l<tokens.length;l++){
tokens[l] = itr.nextToken();
}
for(int i = 0; i < tokens.length; i++){
sb.append(tokens[i]);
for(int j = i+1;j<i+5;j++){
sb.append(" ");
sb.append(tokens[j]);
}
word.set(sb.toString());
context.write(word, one);
System.out.println(sb.toString());
sb.setLength(0);
}