Sorted Hadoop WordCount Java

Question

I am running the WordCount program of Hadoop in Java and my first job (getting all the words and their count) works fine. However I come across a problem when I'm doing the second job who should sort it by their occurence number. I've already read this issue (Hadoop WordCount sorted by word occurrences) to understand how to made a second job but I don't have the same problem.

my code :

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;


public class simpleWordExample {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
    } 


    public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {

            int sum = 0;
            for (IntWritable value:values) {
                sum += value.get();
            }
            context.write(key, new IntWritable(sum));

        }

    } 


class Map1 extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer stringTokenizer = new StringTokenizer(line);
        while (stringTokenizer.hasMoreTokens()){
            int number = 999;
            String word = "empty";

            if (stringTokenizer.hasMoreTokens()) {
                String str0 = stringTokenizer.nextToken();
                word = str0.trim();
            }

            if (stringTokenizer.hasMoreElements()) {
                String str1 = stringTokenizer.nextToken();
                number = Integer.parseInt(str1.trim());
            }
            context.write(new Text(word), new IntWritable(number));
        }

    }

}

class Reduce1 extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        for (IntWritable value:values) {
            context.write(key, new IntWritable(value.get()));
        }
      }
}



public static void main(String[] args) throws Exception {

    Job job1 = new Job();
    Job job2 = new Job();

   job1.setJobName("wordCount");

   job1.setJarByClass(simpleWordExample.class);

   job1.setOutputKeyClass(Text.class);
   job1.setOutputValueClass(IntWritable.class);

   job1.setMapperClass(Map.class);
   job1.setCombinerClass(Reduce.class);
   job1.setReducerClass(Reduce.class);

   job1.setInputFormatClass(TextInputFormat.class);
   job1.setOutputFormatClass(TextOutputFormat.class);

   FileInputFormat.setInputPaths(job1, new Path("file:///home/cloudera/data.txt"));
   FileOutputFormat.setOutputPath(job1, new Path("file:///home/cloudera/output"));


   job2.setJobName("WordCount1");

   job2.setJarByClass(simpleWordExample.class);

   job2.setOutputKeyClass(Text.class);
   job2.setOutputValueClass(IntWritable.class);

   job2.setMapperClass(Map1.class);
   job2.setCombinerClass(Reduce1.class);
   job2.setReducerClass(Reduce1.class);

   job2.setInputFormatClass(TextInputFormat.class);
   job2.setOutputFormatClass(TextOutputFormat.class);

   FileInputFormat.setInputPaths(job2, new Path("file:///home/cloudera/output/part-00000"));
   FileOutputFormat.setOutputPath(job2, new Path("file:///home/cloudera/outputFinal"));


   job1.submit();
   if (job1.waitForCompletion(true)) {
       job2.submit();
       job2.waitForCompletion(true);
   }
}

}

and the error i get in the console :

15/05/02 09:56:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/02 09:56:37 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
15/05/02 09:56:37 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/05/02 09:56:39 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/02 09:56:39 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
15/05/02 09:56:39 INFO input.FileInputFormat: Total input paths to process : 1
15/05/02 09:56:41 INFO mapreduce.JobSubmitter: number of splits:1
15/05/02 09:56:41 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/05/02 09:56:41 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/05/02 09:56:41 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/05/02 09:56:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1998350370_0001
15/05/02 09:56:48 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/05/02 09:56:48 INFO mapreduce.Job: Running job: job_local1998350370_0001
15/05/02 09:56:48 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/05/02 09:56:48 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/05/02 09:56:48 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/02 09:56:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1998350370_0001_m_000000_0
15/05/02 09:56:48 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/05/02 09:56:48 INFO mapred.MapTask: Processing split: file:/home/cloudera/data.txt:0+1528889
15/05/02 09:56:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/05/02 09:56:52 INFO mapreduce.Job: Job job_local1998350370_0001 running in uber mode : false
15/05/02 09:56:52 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/05/02 09:56:52 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/05/02 09:56:52 INFO mapred.MapTask: soft limit at 83886080
15/05/02 09:56:52 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/05/02 09:56:52 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/05/02 09:56:52 INFO mapreduce.Job:  map 0% reduce 0%
15/05/02 09:56:57 INFO mapred.LocalJobRunner: 
15/05/02 09:56:57 INFO mapred.MapTask: Starting flush of map output
15/05/02 09:56:57 INFO mapred.MapTask: Spilling map output
15/05/02 09:56:57 INFO mapred.MapTask: bufstart = 0; bufend = 2109573; bufvoid = 104857600
15/05/02 09:56:57 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25406616(101626464); length = 807781/6553600
15/05/02 09:56:58 INFO mapred.LocalJobRunner: map > sort
15/05/02 09:56:58 INFO mapreduce.Job:  map 67% reduce 0%
15/05/02 09:56:59 INFO mapred.LocalJobRunner: Map task executor complete.
15/05/02 09:56:59 WARN mapred.LocalJobRunner: job_local1998350370_0001
java.lang.Exception: java.lang.RuntimeException: java.lang.NoSuchMethodException: simpleWordExample$Reduce.<init>()
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:401)
Caused by: java.lang.RuntimeException: java.lang.NoSuchMethodException: simpleWordExample$Reduce.<init>()
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1619)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1603)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1452)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:693)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:761)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:233)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NoSuchMethodException: simpleWordExample$Reduce.<init>()
    at java.lang.Class.getConstructor0(Class.java:2706)
    at java.lang.Class.getDeclaredConstructor(Class.java:1985)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
    ... 13 more
15/05/02 09:57:00 INFO mapreduce.Job: Job job_local1998350370_0001 failed with state FAILED due to: NA
15/05/02 09:57:00 INFO mapreduce.Job: Counters: 21
    File System Counters
        FILE: Number of bytes read=1529039
        FILE: Number of bytes written=174506
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=30292
        Map output records=201946
        Map output bytes=2109573
        Map output materialized bytes=0
        Input split bytes=93
        Combine input records=0
        Combine output records=0
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=122
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=165613568
    File Input Format Counters 
        Bytes Read=1528889

Thanks you for your time and help !

Edit Global : new api used

score 1 · Answer 1 · answered May 01 '15 at 15:47

1

Never used hadoop myself but it looks like hadoop is trying to instantiate a "Map" instance using the deafult no-args constructor. It's throwing NoSuchMethodException because it can't find a no-args constructor.

answered May 01 '15 at 15:47

lance-java

25,497
4
59
101

yeah but the first Mapper I use doesn't have a constructor either and it works :/ – Melanie Journe May 01 '15 at 23:34

score 0 · Answer 2 · answered May 01 '15 at 19:53

0

Based on the following lines from your code :

-- Map 1:
class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text> 

   -- From Driver
conf2.setInputFormat(TextInputFormat.class);

When you set the input format to TextInputFormat The Map Key is always LongWritable & value as Text. You have correctly used the TextInputFormat in Map class.

answered May 01 '15 at 19:53

Venkat

1,810
1
11
14

This seems very logical so I did it (as you can see on the edit), Map1 has now the same structure as Map but I still have the same console log :/ – Melanie Journe May 01 '15 at 23:33
Another issue is output of the map1 and input of the reducer1 do not match. Map1 says text, intwritable while reduce1 takes intwritable & text in that order. – Venkat May 02 '15 at 01:28
Done & post edit. No change in the log :/ Could the error be when I set everything in the main because it says "configuration error" ? – Melanie Journe May 02 '15 at 08:59

salmanbw · Answer 3 · 2015-05-02T16:09:49.887

This could be because of the mix and match of the APIs. There are 2 APIs for hadoop the older being mapred and latest being mapreduce. And in your code, you are importing both of them. Try commenting

import org.apache.hadoop.mapred.*;

and import all the files that you will require for your new API. Hope this works for you.

After commenting that, Try writing your code according to new API.

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }
}
} 


public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
protected void reduce(Text key, Iterable<IntWritable> values,
        Context context)
        throws IOException, InterruptedException {

    int sum = 0;
    for (IntWritable value:values) {
        sum += value.get();
    }
    context.write(key, new IntWritable(sum));

}
}

I have written the first Mapper and reducer for you, you can do the same for second mapper and reducer.

In the reduce method I got an error in "IntWritable value:values" it says "Type mismatch : cannot convert from element type object to Intwritable" — Melanie Journe, May 02 '15 at 13:04
I used the doc to put everything up to date even the main method but when I lunch the program, the first maping isn't done, you can see the new log on my post. I look here : https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html and http://stackoverflow.com/questions/8603788/hadoop-jobconf-class-is-deprecated-need-updated-example — Melanie Journe, May 02 '15 at 17:01

score 0 · Answer 4 · answered May 08 '15 at 16:16

Classes Map1, Reduce1 and Reduce have to be static.

There is also an error in job2 configuration:

job2.setOutputKeyClass(Text.class); job2.setOutputValueClass(IntWritable.class);

You should change it to:

job2.setOutputKeyClass(IntWritable.class); job2.setOutputValueClass(Text.class);

I hope this helps.

Sorted Hadoop WordCount Java

4 Answers4