2

I'm new in Hadoop, and i've a problem to solve With MapReduce Java Code. I've a file to read of this kind, where in each row, there is a date and some words (A,B,C..) :

  • 2016-05-10, A, B, C, A, R, E, F, E
  • 2016-05-18, A, B, F, E, E
  • 2016-06-01, A, B, K, T, T, E, G, E, A, N
  • 2016-06-03, A, B, K, T, T, E, F, E, L, T

I've to implement a Map Reduce algorithm, where for each month, i've to find the total occurrencies of each word, and for that i've to say what are the 2 with max occurrencies. I'm expeting a resul of this kind:

  • 2016-05, A:3, E:4
  • 2016-06, T:5, E:4

I've tried two differents solutions to find a way : -The First One :

public class GiulioTest {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line,",");
    String dataAttuale = tokenizer.nextToken().substring(0, line.lastIndexOf("-"));

    while (tokenizer.hasMoreTokens()) {
        String prod = tokenizer.nextToken(",");

            word.set(dataAttuale + ":" + prod);           
            context.write(word, one);

    }

}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();    

public void reduce(Text key, Iterator<IntWritable> values, Context context)
  throws IOException, InterruptedException {
    int sum = 0;
    while (values.hasNext()) {
        sum += values.next().get();
    }
    result.set(sum)
    context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(GiulioTest.class);
job.waitForCompletion(true);
  }

}

i was expecting that this code give me as result, something of this kind :

  • 2016-05: A 3
  • 2016-05: E 4
  • 2016-05: ...
  • 2016-05: other letters
  • 2016-06: T 5
  • 2016-06: ...

and then find a way to find the first two letters with the max occurrencies. Actually, i don't know if there is a way at this point to re-ealaborate the key, to extract the maximium values. Anyone have something to suggest?

Another solution that i'm thinking, but just in pseudo code and i don't know if is possible using MapReduce Framework is : define Text key, define List listValues , define finalMap, //Is a Map with for Value an another Map of String and Integer

mapper(key,value,context) {
   month = //retrieve using String tokenizer splitting(',')
   tmpKey = month;
   while(itr.hasMoreToken()) {
        listValues.add(itr.nextToken())
   }
   key.set(tmpKey)
   context.put(key, listValues) //And here, there is my first doubt, if  is it possible to set in context something like context(Text,List<String>)
 }

reduce(Text key, Iterable<List<String>> values, Context context) {
   Map<String,Int> letterVal = new ...;
   for(List<String> listLetter : values) {
      while(listLetter.haContent()) {
         String letter = listLetter.next()
         if(letterVal.contains(letter)) {
             Int tmpVal = letterVal.get(letter)
             letterVal.put(letter, tmpVal+1);
         } else 
             letterVal.put(letter,1)
       }
     }
     finalMap.put(key, letterVal)
     context.write(finalMap.get(key), finalMap.toString)
 }
GIULIO
  • 41
  • 8
  • 1
    Have you tried something? Hints: have a look at the following notions: composite key/ custom paritioner / chaining MapReduce jobs / CustomWritable. Those are all you need (or actually more than what you need). – vefthym Jun 02 '17 at 10:40
  • 1
    You should post the code that you've written so we can understand how to help you. Simply posting the problem you're trying to solve could be interpreted as asking the community to write code for you and that's not what SO is about. – Graham Jun 02 '17 at 10:42
  • I've added my code @Graham – GIULIO Jun 02 '17 at 11:32
  • Your first approach looks working, probably you can add some logs and command to run? for second approach you can use `ArrayWritable` http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/io/ArrayWritable.html – fi11er Jun 02 '17 at 14:01
  • @fi11er i've fixed my code about the first bug, but know i've no idea about how to elaborate the reducer to find the max 2 occurrencies lecter – GIULIO Jun 03 '17 at 15:03

0 Answers0