I'm new in Hadoop, and i've a problem to solve With MapReduce Java Code. I've a file to read of this kind, where in each row, there is a date and some words (A,B,C..) :
- 2016-05-10, A, B, C, A, R, E, F, E
- 2016-05-18, A, B, F, E, E
- 2016-06-01, A, B, K, T, T, E, G, E, A, N
- 2016-06-03, A, B, K, T, T, E, F, E, L, T
I've to implement a Map Reduce algorithm, where for each month, i've to find the total occurrencies of each word, and for that i've to say what are the 2 with max occurrencies. I'm expeting a resul of this kind:
- 2016-05, A:3, E:4
- 2016-06, T:5, E:4
I've tried two differents solutions to find a way : -The First One :
public class GiulioTest {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line,",");
String dataAttuale = tokenizer.nextToken().substring(0, line.lastIndexOf("-"));
while (tokenizer.hasMoreTokens()) {
String prod = tokenizer.nextToken(",");
word.set(dataAttuale + ":" + prod);
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterator<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
result.set(sum)
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(GiulioTest.class);
job.waitForCompletion(true);
}
}
i was expecting that this code give me as result, something of this kind :
- 2016-05: A 3
- 2016-05: E 4
- 2016-05: ...
- 2016-05: other letters
- 2016-06: T 5
- 2016-06: ...
and then find a way to find the first two letters with the max occurrencies. Actually, i don't know if there is a way at this point to re-ealaborate the key, to extract the maximium values. Anyone have something to suggest?
Another solution that i'm thinking, but just in pseudo code and i don't know if is possible using MapReduce Framework is : define Text key, define List listValues , define finalMap, //Is a Map with for Value an another Map of String and Integer
mapper(key,value,context) {
month = //retrieve using String tokenizer splitting(',')
tmpKey = month;
while(itr.hasMoreToken()) {
listValues.add(itr.nextToken())
}
key.set(tmpKey)
context.put(key, listValues) //And here, there is my first doubt, if is it possible to set in context something like context(Text,List<String>)
}
reduce(Text key, Iterable<List<String>> values, Context context) {
Map<String,Int> letterVal = new ...;
for(List<String> listLetter : values) {
while(listLetter.haContent()) {
String letter = listLetter.next()
if(letterVal.contains(letter)) {
Int tmpVal = letterVal.get(letter)
letterVal.put(letter, tmpVal+1);
} else
letterVal.put(letter,1)
}
}
finalMap.put(key, letterVal)
context.write(finalMap.get(key), finalMap.toString)
}