merge output files after reduce phase

Question

In mapreduce each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task. Does map/reduce merge these files? If yes, how?

score 122 · Accepted Answer · edited Mar 01 '18 at 06:28

122

Instead of doing the file merging on your own, you can delegate the entire merging of the reduce output files by calling:

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

Note This combines the HDFS files locally. Make sure you have enough disk space before running

edited Mar 01 '18 at 06:28

OneCricketeer

179,855
19
132
245

answered Apr 20 '11 at 23:50

diliop

9,241
5
28
23

16

is there a way to do this but on the dfs? I mean I want to merge them into a single file on the dfs? – humanzz May 22 '12 at 18:18
10

It doesn't seem to work with the dfs, the merged file gets written to the local file system. Of course you could just write it back, but seems wasteful. – Marius Soutier Aug 14 '14 at 12:31
4

NB: this isn't safe with non-text files. `getMerge` does a simple concatenation of files, which with something like a SequenceFile will not give a sane output. – growse Oct 12 '14 at 12:05
2

This does not work with HDFS as the destination which is what is intended. – Gaurav Kumar Sep 16 '15 at 15:47
getmerge brings the data from hdfs to local. – armourbear Feb 25 '17 at 02:41
If in dfs, you can merger txt or csv files: $ hadoop fs -cat /folder_with_files/* | hadoop fs -put - /output_folder/outputFile.dat – maroon912 Mar 06 '18 at 09:04
This doesn't work for the canonical Hadoop case of part*.gz files. The simple concat function is only meant for simple text files – jayant Jun 13 '18 at 00:17
Yes, you can merge files in hdfs directly without getting them in local. Follow this : https://stackoverflow.com/a/31188767/4401399 – Mervyn Apr 27 '20 at 06:34

score 28 · Answer 2 · answered Apr 18 '11 at 10:06

28

No, these files are not merged by Hadoop. The number of files you get is the same as the number of reduce tasks.

If you need that as input for a next job then don't worry about having separate files. Simply specify the entire directory as input for the next job.

If you do need the data outside of the cluster then I usually merge them at the receiving end when pulling the data off the cluster.

I.e. something like this:

hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt

answered Apr 18 '11 at 10:06

Niels Basjes

10,424
9
50
66

Thanks for your answer buf in config file of map/reduce (*mapred-default.xml*) there is an attribute named *io.sort.factor*, what does it used for??? – Shahryar Apr 19 '11 at 07:06
2

The io.sort.factor has to do with the processing BETWEEN the map and the reduce step. Not the output of the reduce. – Niels Basjes Apr 19 '11 at 07:29
how do you know the order in which part-r-* file will be merged is the right one ? – Razvan Apr 28 '16 at 09:55
3

@Razvan: The order should not matter. If it does matter then you have an algorithm that doesn't scale and you apparently have assumptions regarding which Reducer has done which part of the work. So if that happens you have a problem of a different kind. – Niels Basjes Apr 28 '16 at 11:19
@NielsBasjes: It is better to use "hadoop fs -getmerge" instead of "hadoop fs -cat" – Naga Jun 30 '16 at 19:14

score 8 · Answer 3 · answered Jul 02 '15 at 15:29

That's the function you can use to Merge Files in HDFS

public boolean getMergeInHdfs(String src, String dest) throws IllegalArgumentException, IOException {
    FileSystem fs = FileSystem.get(config);
    Path srcPath = new Path(src);
    Path dstPath = new Path(dest);

    // Check if the path already exists
    if (!(fs.exists(srcPath))) {
        logger.info("Path " + src + " does not exists!");
        return false;
    }

    if (!(fs.exists(dstPath))) {
        logger.info("Path " + dest + " does not exists!");
        return false;
    }
    return FileUtil.copyMerge(fs, srcPath, fs, dstPath, false, config, null);
}

score 8 · Answer 4 · answered Sep 16 '15 at 15:46

For text files only and HDFS as both the source and destination, use the below command:

hadoop fs -cat /input_hdfs_dir/* | hadoop fs -put - /output_hdfs_file

This will concatenate all the files in input_hdfs_dir and will write the output back to HDFS at output_hdfs_file. Do keep in mind that all the data will be brought back to the local system and then again uploaded to hdfs, although no temporary files are created and this happens on the fly using UNIX pe.

Also, this won't work with non-text files such as Avro, ORC etc.

For binary files, you could do something like this (if you have Hive tables mapped on the directories):

insert overwrite table tbl select * from tbl

Depending on your configuration, this could also create more than files. To create a single file, either set the number of reducers to 1 explicitly using mapreduce.job.reduces=1 or set the hive property as hive.merge.mapredfiles=true.

With this solution also be aware of the possible input getting into the final destination from stdin. Namely, I came across a situation when in HA enabled cluster there is a warning message when one of the nodes are in standby mode. In that situation my output contained that otherwise innocent warning messages. [link](https://s.apache.org/sbnn-error) — kasur, Oct 20 '16 at 21:47

score 4 · Answer 5 · answered Oct 27 '15 at 05:47

The part-r-nnnnn files are generated after the reduce phase designated by 'r' in between. Now the fact is if you have one reducer running, you will have an output file like part-r-00000. If the number of reducers are 2 then you're going to have part-r-00000 and part-r-00001 and so on. Look, if the output file is too large to fit into the machine memory since the hadoop framework has been designed to run on Commodity Machines, then the file gets splitted. As per the MRv1, you have a limit of 20 reducers to work on your logic. You may have more but the same needs to be customised in the configuration files mapred-site.xml. Talking about your question; you may either use getmerge or you may set the number of reducers to 1 by embedding the following statement to the driver code

job.setNumReduceTasks(1);

Hope this answers your question.

score 3 · Answer 6 · answered Apr 18 '11 at 09:19

3

You can run an additional map/reduce task, where map and reduce don't change the data, and partitioner assigns all data to a single reducer.

answered Apr 18 '11 at 09:19

adamax

3,835
2
19
21

1

Not if you need to merge more data than the local machine can handle – Havnar May 27 '16 at 07:01

score 1 · Answer 7 · answered Oct 27 '15 at 10:20

Besides my previous answer I have one more answer for you which I was trying few minutes ago. You may use CustomOutputFormat which looks like the code given below

public class VictorOutputFormat extends FileOutputFormat<StudentKey,PassValue> {

    @Override
    public RecordWriter<StudentKey,PassValue> getRecordWriter(
            TaskAttemptContext tac) throws IOException, InterruptedException {
        //step 1: GET THE CURRENT PATH
        Path currPath=FileOutputFormat.getOutputPath(tac);

        //Create the full path
        Path fullPath=new Path(currPath,"Aniruddha.txt");

        //create the file in the file system
        FileSystem fs=currPath.getFileSystem(tac.getConfiguration());
        FSDataOutputStream fileOut=fs.create(fullPath,tac);
        return new VictorRecordWriter(fileOut);
    }

}

Just, have a look at the fourth line from the last. I have used my own name as the output file name and I have tested the program with 15 reducers. Still the File remains the same. So getting a single out file instead of two or more is possible yet to be very clear the size of the output file must not exceed the size of the primary memory i.e. the output file must fit into the memory of the commodity machine else there might be a problem with the output file split. Thanks!!

getmerge can solve your purpose but that's an alternative. but that's useful — Aniruddha Sinha, Oct 27 '15 at 10:22

score 0 · Answer 8 · edited Dec 21 '13 at 04:03

0

Why not use a pig script like this one for merging partition files:

stuff = load "/path/to/dir/*"

store stuff into "/path/to/mergedir"

edited Dec 21 '13 at 04:03

seaotternerd

6,298
2
47
58

answered Dec 21 '13 at 03:40

Ian

1

score 0 · Answer 9 · answered Jan 18 '17 at 18:12

0

If the files have header, you can get rid of it of by doing this:

hadoop fs -cat /path/to/hdfs/job-output/part-* | grep -v "header" > output.csv

then add the header manually for output.csv

answered Jan 18 '17 at 18:12

Masih

1,633
4
19
38

score 0 · Answer 10 · edited May 23 '17 at 12:26

. Does map/reduce merge these files?

No. It does not merge.

You can use IdentityReducer to achieve your goal.

Performs no reduction, writing all input values directly to the output.

public void reduce(K key,
                   Iterator<V> values,
                   OutputCollector<K,V> output,
                   Reporter reporter)
            throws IOException

Writes all keys and values directly to output.

Have a look at related SE posts:

hadoop: difference between 0 reducer and identity reducer?

merge output files after reduce phase

10 Answers10

Linked

Related