How to use Hadoop Streaming with LZO-compressed Sequence Files?

Question

I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.

For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

What do I need to do in order to process these input files with Hadoop Streaming?

I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?

I've tried using a very simple identity as both my mapper and reducer

#!/usr/bin/env ruby

STDIN.each do |line|
  puts line
end

but this doesn't work.

mat kelcey · Accepted Answer · 2011-12-28T06:59:32.087

6

lzo is packaged as part of elastic mapreduce so there's no need to install anything.

i just tried this and it works...

 hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
  -D mapred.reduce.tasks=0 \
  -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
  -inputformat SequenceFileAsTextInputFormat \
  -output test_output \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper

edited Dec 28 '11 at 06:59

answered Jun 15 '11 at 21:35

mat kelcey

3,077
2
30
35

Do you know if the LZO files need to be indexed (e.g. via the Kevin Weil hadoop-lzo indexer) before the files will be splittable by Hadoop on EMR, or does Hadoop splitting of large files just work as if they were text files? – Dolan Antenucci Oct 22 '12 at 01:40
EMR doesn't index or split LZO files by default, you have to create the index file first (using the hadoop-lzo indexer you mentioned) – Dan Osipov Aug 26 '13 at 14:24

score 3 · Answer 2 · answered Feb 24 '11 at 03:46

Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.

Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.

Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.

Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.

Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.

Restart the cluster once all changes are made.

score 1 · Answer 3 · answered Feb 21 '11 at 20:47

1

You may want to look at this https://github.com/kevinweil/hadoop-lzo

answered Feb 21 '11 at 20:47

chiku

258
1
10

score 0 · Answer 4 · answered Dec 04 '12 at 09:31

I have weird results use lzo and my problem get resolved with some other codec

-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

Then things just work. You don't need (maybe also shouldn't) to change the -inputformat.

Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690

How to use Hadoop Streaming with LZO-compressed Sequence Files?

4 Answers4

Linked