As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?
3 Answers
I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs
class:
public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
or
public <K,V> void write(String namedOutput, K key, V value,
String baseOutputPath)
The former write
method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.
The latter write
method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput
function:
public static void addNamedOutput(Job job,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Class<?> keyClass,
Class<?> valueClass)
So if you need different output types than the Context
is using, you must use the latter write
method.
The trick to getting different output directories is to pass a baseOutputPath
that contains a directory separator, like this:
multipleOutputs.write("output1", key, value, "dir1/part");
In my case, this created files named "dir1/part-r-00000".
I was not successful in using a baseOutputPath
that contains the ..
directory, so all baseOutputPath
s are strictly contained in the path passed to the -output
parameter.
For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java

- 3,085
- 8
- 36
- 47
-
I forgot to mention that I tested varying the `baseOutputPath` based on key/value data, and it successfully output to different files. – Eddified Oct 11 '13 at 17:34
-
I am glad you mentioned it, I had found it out eventually though :) – Amar Oct 15 '13 at 18:52
-
Thanks a lot for the "dir1/part" part, wouldn't have thought of that! – ssgao May 01 '14 at 02:21
Similar to: Hadoop Reducer: How can I output to multiple directories using speculative execution?
Basically you can write to HDFS directly from your reducer - you'll just need to be wary of speculative execution and name your files uniquely, then you'll need to implement you own OutputCommitter to clean up the aborted attempts (this is the most difficult part if you have truely dynamic output folders - you'll need to step through each folder and delete the attemps associated with aborted / failed tasks). A simple solution to this is to turn off speculative execution

- 1
- 1

- 29,949
- 4
- 71
- 93
-
This doesn't sound simple :P Any workaround for MultipleTextOutputFormat? Or can we implement something like MultipleTextOutputFormat using the new API? – Amar Feb 27 '13 at 09:56
-
As stated in the javadoc of multiple outputs i added the below code in my job and reducer and it works fine. In the job: MultipleOutputs.addNamedOutput(job, namedoutputstring, outputformatclass, keyclass, valueclass); In the reducer: mos = new MultipleOutputs
(context); ... /*calculated at runtime */ baseoutput = "abc/xyz/filename"; mos.write(key, value, baseOutput); – techuser soma Mar 06 '13 at 19:34 -
For the best answer,turn to Hadoop - definitive guide 3rd Ed.(starting pg. 253.)
An Excerpt from the HDG book -
"In the old MapReduce API, there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API."
It has an example on how you can control directory structure,file naming and output format using MultipleOutputs API.
HTH.

- 995
- 5
- 11
-
Does MultipleOutputs allow to decide output folder name or file name on the fly based on key-value pairs?? I don't think so, if there is a way kindly let me know. – Amar Feb 27 '13 at 09:30
-
Yeah dude and I couldn't find it! Using MultipleOutputs you can only write to a *set* of *pre-defined* filepaths. And you do this by using `MultipleOutputs.addNamedOutput()` in your `run()`. It's possible that I might be missing here something but rathe than making such statements, if it is *easily found elsewhere*, you could have at least posted a link to it. – Amar Feb 27 '13 at 16:24
-
I also doubt that you have used either MultipleTextOutputFormat or MultipleOutputs! Reading the book, it clearly states, just before the example : *In comparison to MutipleTextOutputFormat there is less control over the naming of the outputs while using MultipleOutputs*. – Amar Feb 27 '13 at 16:30