0

I'm implementing word-count-style program on google books ngram. My input is binary file: https://aws.amazon.com/datasets/google-books-ngrams/ And I was told to use SequenceFileInputFormat in order to use it.

I'm using hadoop 2.6.5.

Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "PartA");
        job.setJarByClass(MyDriver.class);
        job.setMapperClass(MyMapperA.class);
        job.setReducerClass(MyReducerA.class);
        job.setCombinerClass(MyReducerA.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(SequenceFileInputFormat.class); // The new line

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

Sadly i'm receiving problems after adding this line:

            job.setInputFormatClass(SequenceFileInputFormat.class);

The errors received:

java.lang.Exception: java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)

I've tried adding several maven dependencies, but without success

jonb
  • 845
  • 1
  • 13
  • 36
  • Hadoop doesn't come with LZO out of the box. You need to actually add it to the classpath of all the NodeManagers and Hadoop clients, probably not just your application – OneCricketeer Dec 09 '17 at 00:36
  • Similar issue https://stackoverflow.com/questions/23441142/class-com-hadoop-compression-lzo-lzocodec-not-found-for-spark-on-cdh-5 – OneCricketeer Dec 09 '17 at 00:39
  • I've seen this post, now i'm searching for a way to install it on windows, part of the post refers only to mac/linux/debian – jonb Dec 09 '17 at 07:45
  • Tried to add this dependecy com.hadoop.gplcompression hadoop-lzo cdh4-0.4.15-gplextras But receiving : Missing artifact com.hadoop.gplcompression:hadoop-lzo:jar:cdh4-0.4.15-gplextras – jonb Dec 09 '17 at 07:57
  • 1
    You still need to actually build LZO from source on your Windows boxes. Alternatively, you can use a Linux VM, or an EMR instance for doing your work. https://github.com/twitter/hadoop-lzo/blob/master/README.md – OneCricketeer Dec 09 '17 at 14:46

0 Answers0