-1

I am new to Hadoop and MapReduce and have been trying to write output to multiple files based on keys. Could anyone please provide clear idea or Java code snippet example on how to use it. My mapper is working exactly fine and after shuffle, keys and the corresponding values are obtained as expected. Thanks!

What i am trying to do is output only few records from the input file to a new file. Thus the new output file shall contain only those required records, ignoring rest irrelevant records. This would work fine even if i don't use MultipleTextOutputFormat. Logic which i implemented in mapper is as follows:

 public static class MapClass extends
            Mapper {

    StringBuilder emitValue = null;
    StringBuilder emitKey = null;
    Text kword = new Text();
    Text vword = new Text();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String[] parts;

        String line = value.toString();
        parts = line.split(" ");

            kword.set(parts[4].toString());
            vword.set(line.toString());
            context.write(kword, vword);
        }
    }

Input to reduce is like this:
[key1]--> [value1, value2, ...]
[key2]--> [value1, value2, ...]
[key3]--> [value1, value2, ...] & so on
my interest is in [key2]--> [value1, value2, ...] ignoring other keys and corresponding values. Please help me out with the reducer.

Vaibhav
  • 19
  • 4

1 Answers1

1

Using MultipleOutputs lets you emit records in multiple files, but in a set of pre-defined number/type of files only and not arbitrary number of files and not with on-the-fly decision on filename according to key/value.

You may create your own OutputFormat by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat. Your OutputFormat class shall enable decision of output file name as well as folder according to the key/value emitted by reducer. This can be achieved as follows:

 package oddjob.hadoop;

 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

 public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {

        /**
        * Use they key as part of the path for the final output file.
        */
       @Override
       protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
             return new Path(key.toString(), leaf).toString();
       }

       /**
        * When actually writing the data, discard the key since it is already in
        * the file path.
        */
       @Override
       protected Text generateActualKey(Text key, Text value) {
             return null;
          }
 }

For more info read here.

PS: You will need to use the old mapred API to achieve that. As in the newer API there isn't support for MultipleTextOutput yet! Refer this.

Community
  • 1
  • 1
Amar
  • 11,930
  • 5
  • 50
  • 73
  • Hi Amar, thanks much!! I am looking forward to implement the same in the newer API. So, is there any other alternative that can be used in the newer API? I have been trying to use something like this[link](http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) Could you please suggest anything? Thanks. – Vaibhav Mar 12 '13 at 05:45
  • Why can't you use the older API? I guess for this specific case you shall use the old API. And as you may see in another question's link I have put in my answer it is not yet implemented for the newer API. – Amar Mar 12 '13 at 06:23
  • checkout this question: http://stackoverflow.com/questions/15100621/multipletextoutputformat-alternative-in-new-api – Amar Mar 13 '13 at 08:58
  • Use old API. It shouldn't be difficult to convert your code to use the old API. There is no harm in using it. – Amar Mar 13 '13 at 11:53