Save and read complicated Writable value in Hadoop job

Question

I need to move complicated value (implements Writable) from output of 1st map-reduce job to input of other map-reduce job. Results of 1st job saved to file. File can store Text data or BytesWritable (with default output \ input formats). So I need some simple way to convert my Writable to Text or To BytesWritable and from it. Does it exists? Any alternative way to do this? Thanks a lot

I haven't tried it myself but you might be able to write your output to a sequencefile, then you don't need any conversions. Someone else will probably be able to elaborate on this. It might get you started. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html — DDW, Oct 15 '13 at 11:54

Alex A. · Accepted Answer · 2013-10-16T05:05:58.257

User irW is correct, use SequenceFileOutputFormat. SequenceFile solves this exact problem, without converting to Text Writable. When setting up your job, use job.setOutputKeyClass and job.setOutputValueClass to set the Writable subclasses you are using:

job.setOutputKeyClass(MyWritable1.class);
job.setOutputValueClass(MyWritable2.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

This will use the Hadoop SequenceFile format to store your Writables. Then in your next job, use SequenceFileInputFormat:

job.setInputFormatClass(SequenceFileInputFormat.class);

Then the input key and value for the mapper in this job will be the two Writable classes you originally specified as output in the previous job.

Note, it is crucial that your complex Writable subclass is implemented correctly. Beyond the fact that you must have an empty constructor, the write and readFields methods must be implemented such that any Writable fields in the class also write and read their information.

Since I'm a fan of keeping the simple, I will add this sidenote in the comments. If you at any point might want to use anything other than the Java API to read your files, you're going to want to use Avro instead, which is a language independent serialization format. It would allow you to easily process your data with useful tools like Pig or any program compatible with MapReduce streaming. I've been through the pain of using SequenceFiles and regretting it, see this question: http://stackoverflow.com/questions/18884666/handling-writables-fully-qualified-name-changes-in-hadoop-sequencefile — Alex A., Oct 16 '13 at 05:02

Save and read complicated Writable value in Hadoop job

1 Answers1

Linked