MultipleTextOutputFormat alternative in new API

Question

As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?

score 4 · Accepted Answer · answered Oct 11 '13 at 17:32

I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:

public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)

or

public <K,V> void write(String namedOutput, K key, V value,
                        String baseOutputPath)

The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.

The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:

public static void addNamedOutput(Job job,
                              String namedOutput,
                              Class<? extends OutputFormat> outputFormatClass,
                              Class<?> keyClass,
                              Class<?> valueClass)

So if you need different output types than the Context is using, you must use the latter write method.

The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:

multipleOutputs.write("output1", key, value, "dir1/part");

In my case, this created files named "dir1/part-r-00000".

I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.

For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java

I forgot to mention that I tested varying the `baseOutputPath` based on key/value data, and it successfully output to different files. — Eddified, Oct 11 '13 at 17:34
I am glad you mentioned it, I had found it out eventually though :) — Amar, Oct 15 '13 at 18:52
Thanks a lot for the "dir1/part" part, wouldn't have thought of that! — ssgao, May 01 '14 at 02:21

score 0 · Answer 2 · edited May 23 '17 at 12:32

0

Similar to: Hadoop Reducer: How can I output to multiple directories using speculative execution?

Basically you can write to HDFS directly from your reducer - you'll just need to be wary of speculative execution and name your files uniquely, then you'll need to implement you own OutputCommitter to clean up the aborted attempts (this is the most difficult part if you have truely dynamic output folders - you'll need to step through each folder and delete the attemps associated with aborted / failed tasks). A simple solution to this is to turn off speculative execution

edited May 23 '17 at 12:32

Community

1
1

answered Feb 27 '13 at 01:12

Chris White

29,949
4
71
93

This doesn't sound simple :P Any workaround for MultipleTextOutputFormat? Or can we implement something like MultipleTextOutputFormat using the new API? – Amar Feb 27 '13 at 09:56
As stated in the javadoc of multiple outputs i added the below code in my job and reducer and it works fine. In the job: MultipleOutputs.addNamedOutput(job, namedoutputstring, outputformatclass, keyclass, valueclass); In the reducer: mos = new MultipleOutputs(context); ... /*calculated at runtime */ baseoutput = "abc/xyz/filename"; mos.write(key, value, baseOutput); – techuser soma Mar 06 '13 at 19:34
don't forget mos.close() in cleanup(). – Judge Mental Oct 11 '13 at 20:44

Eswara Reddy Adapa · Answer 3 · 2013-02-27T05:27:10.597

-1

For the best answer,turn to Hadoop - definitive guide 3rd Ed.(starting pg. 253.)

An Excerpt from the HDG book -

"In the old MapReduce API, there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API."

It has an example on how you can control directory structure,file naming and output format using MultipleOutputs API.

HTH.

edited Feb 27 '13 at 05:27

answered Feb 27 '13 at 05:19

Eswara Reddy Adapa

995
5
11

Does MultipleOutputs allow to decide output folder name or file name on the fly based on key-value pairs?? I don't think so, if there is a way kindly let me know. – Amar Feb 27 '13 at 09:30
Yeah dude and I couldn't find it! Using MultipleOutputs you can only write to a *set* of *pre-defined* filepaths. And you do this by using `MultipleOutputs.addNamedOutput()` in your `run()`. It's possible that I might be missing here something but rathe than making such statements, if it is *easily found elsewhere*, you could have at least posted a link to it. – Amar Feb 27 '13 at 16:24
I also doubt that you have used either MultipleTextOutputFormat or MultipleOutputs! Reading the book, it clearly states, just before the example : *In comparison to MutipleTextOutputFormat there is less control over the naming of the outputs while using MultipleOutputs*. – Amar Feb 27 '13 at 16:30

MultipleTextOutputFormat alternative in new API

3 Answers3

Linked