Hadoop - Merge reducer outputs to a single file using Java

Question

I have a pig script that generates some output to a HDFS directory. The pig script also generates a SUCCESS file in the same HDFS directory. The output of the pig script is split into multiple parts as the number of reducers to use in the script is defined via 'SET default_parallel n;'

I would like to now use Java to concatenate/merge all the file parts into a single file. I obviously want to ignore the SUCCESS file while concatenating. How can I do this in Java?

Thanks in advance.

You may be interested in this other question: http://stackoverflow.com/questions/12911798/hadoop-how-can-i-merge-reducer-outputs-to-a-single-file — frb, Apr 22 '15 at 19:54

score 3 · Accepted Answer · edited May 23 '17 at 12:24

you can use getmerge through shell command to merge multiple file into single file.

Usage: hdfs dfs -getmerge <srcdir> <destinationdir/file.txt>

Example: hdfs dfs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

In case you don't want to use shell command to do it. You can write a java program and can use FileUtil.copyMerge method to merge output file into single file. The implementation details are available in this link

if you want a single output on hdfs itself through pig then you need to pass it through single reducer. You need to set number of reducer 1 to do so. you need to put below line at the start of your script.

--Assigning only one reducer in order to generate only one output file.
SET default_parallel 1;

I hope this will help you.

+1 for getmerge. But I think it only works for plain text files? So as soon as you're using compressed output or avro or parquet files, this will most likely not work. — LiMuBei, Apr 23 '15 at 08:58

score 1 · Answer 2 · answered Apr 22 '15 at 22:10

The reason why this does not seem easy to do, is typically there would be little purpose. If I have a very large cluster, and I am really dealing with a Big Data problem, my output file as a single file would probably not fit onto any single machine.

That being said, I can see use metrics collections where maybe you want just output some metrics about your data, like counts.

In that case I would first run your MapReduce program, Then create a 2nd map/reduce job that reads the data, and reduces all the elements to the single same reducer by using the a static key with your reduce function.

Or you could also just use a single mapper with your original program with Job.setNumberOfReducer(1);

Hadoop - Merge reducer outputs to a single file using Java

2 Answers2