Hadoop: How can i merge reducer outputs to a single file?

Question

I know that "getmerge" command in shell can do this work.

But what should I do if I want to merge these outputs after the job by HDFS API for java？

What i actually want is a single merged file on HDFS.

The only thing i can think of is to start an additional job after that.

thanks!

score 10 · Accepted Answer · edited Oct 28 '17 at 02:07

10

But what should I do if I want to merge these outputs after the job by HDFS API for java?

Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge command. FileUtil.copyMerge takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream

That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge followed by -put.

edited Oct 28 '17 at 02:07

Andrea Bergonzo

3,983
4
19
31

answered Oct 16 '12 at 19:40

VoiceOfUnreason

52,766
5
49
91

5

Thanks for your answer. I have just tried like this: `String srcPath = "/user/hadoop/output"; String dstPath = "/user/hadoop/merged_file"; Configuration conf = new Configuration(); try { FileSystem hdfs = FileSystem.get(conf); FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, conf, null); } catch (IOException e) { }`. That succesully merged output files as a single file on hdfs, and the order is just as my expection. But I have another question now. How does the function konw the files order ? – thomaslee Oct 17 '12 at 03:12
Here's the implementation of copyMerge: http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/fs/FileUtil.java#FileUtil.copyMerge%28org.apache.hadoop.fs.FileSystem%2Corg.apache.hadoop.fs.Path%2Corg.apache.hadoop.fs.FileSystem%2Corg.apache.hadoop.fs.Path%2Cboolean%2Corg.apache.hadoop.conf.Configuration%2Cjava.lang.String%29 It looks like it's all down to the ordering of the items returned by the FileSystem's listStatus method. I'd guess that your output files are just concatenated together. – Ben McCracken Oct 18 '12 at 19:12
3

@ Thomas, Ben : I am trying to merge files from my reducer's output using FileUtil.copyMerge. However I have a question here, the source destination contains _SUCCESS and _log files too apart from part-r-00000. part-r-00001. Does copyMerge take in only reducer output files or should I explicitly filter what files have to me merged? If yes, how can I do that? Thanks. – Nikhil Das Nomula Nov 15 '12 at 14:55

saurabh shashank · Answer 2 · 2012-10-16T11:20:34.257

4

You get a single Out-put File by Setting a single Reducer in your code .

Job.setNumberOfReducer(1);

Will work for your requirement , but costly

OR

Static method to execute a shell command. 
Covers most of the simple cases without requiring the user to implement the Shell interface.

Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.

org.apache.hadoop.util.Shell.execCommand(String[])

edited Oct 16 '12 at 11:20

answered Oct 16 '12 at 09:53

saurabh shashank

1,343
2
14
22

Thanks for your answer. That indeed works，but costly as you say. Is there a way to merge them by hdfs API ? – thomaslee Oct 16 '12 at 10:03
1

I will even go with your choice of another Job for it. OR i have edited the ans . – saurabh shashank Oct 16 '12 at 10:21
yeah, may be start another job is better. I will also try execCommand before making a choice. Thank you very much! – thomaslee Oct 16 '12 at 10:35
Great answer. It's helpful if you want to prepare compressed Avro file for some external system. For example I process 5 JSON files 1Gb each and reduce output to 1 Avro file compressed with XZ to 100Mb. In other case I would get 5 Avro files 50Mb each ~ 250Mb total. – Viacheslav Dobromyslov Oct 25 '14 at 06:47

Hadoop: How can i merge reducer outputs to a single file?

2 Answers2

Linked