SequenceFileInputFormat for Hadoop

Question

I'm implementing word-count-style program on google books ngram. My input is binary file: https://aws.amazon.com/datasets/google-books-ngrams/ And I was told to use SequenceFileInputFormat in order to use it.

I'm using hadoop 2.6.5.

Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "PartA");
        job.setJarByClass(MyDriver.class);
        job.setMapperClass(MyMapperA.class);
        job.setReducerClass(MyReducerA.class);
        job.setCombinerClass(MyReducerA.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(SequenceFileInputFormat.class); // The new line

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

Sadly i'm receiving problems after adding this line:

            job.setInputFormatClass(SequenceFileInputFormat.class);

The errors received:

java.lang.Exception: java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)

I've tried adding several maven dependencies, but without success

Hadoop doesn't come with LZO out of the box. You need to actually add it to the classpath of all the NodeManagers and Hadoop clients, probably not just your application — OneCricketeer, Dec 09 '17 at 00:36
Similar issue https://stackoverflow.com/questions/23441142/class-com-hadoop-compression-lzo-lzocodec-not-found-for-spark-on-cdh-5 — OneCricketeer, Dec 09 '17 at 00:39
I've seen this post, now i'm searching for a way to install it on windows, part of the post refers only to mac/linux/debian — jonb, Dec 09 '17 at 07:45
Tried to add this dependecy com.hadoop.gplcompression hadoop-lzo cdh4-0.4.15-gplextras But receiving : Missing artifact com.hadoop.gplcompression:hadoop-lzo:jar:cdh4-0.4.15-gplextras — jonb, Dec 09 '17 at 07:57
You still need to actually build LZO from source on your Windows boxes. Alternatively, you can use a Linux VM, or an EMR instance for doing your work. https://github.com/twitter/hadoop-lzo/blob/master/README.md — OneCricketeer, Dec 09 '17 at 14:46

SequenceFileInputFormat for Hadoop

0 Answers0