Merging multiple files into one within Hadoop

Question

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?

Thanks!

You should accept an answer if your question has been answered. — FirstName LastName, Jun 10 '15 at 04:49

Guy B · Answer 1 · 2014-11-25T14:16:44.307

23

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat

If you want compression add
-Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

edited Nov 25 '14 at 14:16

answered Nov 25 '14 at 12:54

Guy B

239
2
3

1

I think it is the best method. – kholis Mar 18 '15 at 03:38
1

I imagine this would change the order of the lines? – Andre de Miranda Jan 31 '16 at 10:02
1

@AndredeMiranda I think the order will be deterministic, sorted by key, since we only have one reducer. This is based on recalling the shuffle, sort, reducer model. – Patrick the Cat Mar 08 '16 at 13:35
it's not jsut the best answer; it is the answer. all other answers are not correct (e.g. fs -getmerge will put file locally, not on hdfs ) – Tagar Jun 09 '16 at 06:48
Uhm, doing that adds a tabulation at the end of each line... how should we fix that? – Camusensei May 17 '17 at 14:36

score 17 · Answer 2 · answered Aug 24 '10 at 17:46

17

hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>

answered Aug 24 '10 at 17:46

Harsha Hulageri

2,810
1
22
23

4

oddly this gives me no result. not sure why. – jayunit100 May 03 '12 at 03:08
maybe your directory only has empty files – Miguel Ping Sep 28 '12 at 12:40
8

I think `mergedsinglefile` is local, not distributed – sds May 16 '14 at 19:32
8

this will result with files on the local filesystem, which the OP wants to avoid – Rok Sep 10 '15 at 15:09
This does not put the file to hdfs, rather saves it to dfs. We then need to put the file back to hdfs using hdfs -put. – Shivendra Jul 30 '17 at 02:44

score 7 · Answer 3 · edited Aug 06 '13 at 15:49

7

okay...I figured out a way using hadoop fs commands -

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

It worked when I tested it...any pitfalls one can think of?

Thanks!

edited Aug 06 '13 at 15:49

kleopatra

51,061
28
99
211

answered Aug 25 '10 at 08:49

uHadoop

447
1
5
7

9

But in this case you're downloading all data from HDFS to the node you're running command from (local one?), and then upload it to HDFS. This is not too effective if you have much of data – Vadim Jul 18 '12 at 09:16
Another pitfall is that occasionally you might get also some unwanted input from stdin. I came across it once in an HA enabled cluster when some warning messages got trapped into the output. – kasur Oct 21 '16 at 10:45

score 4 · Answer 4 · edited Sep 14 '12 at 12:49

If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.

For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:

hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt

Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.

score 1 · Answer 5 · answered Oct 04 '10 at 11:46

1

You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.

answered Oct 04 '10 at 11:46

Jeff Hammerbacher

4,226
2
29
36

Thanks Jeff, will look into HDFSConcat. Currently we are on 0.20.2 so I am now creating a Har of all the files and then reading from pig. This way data stays in HDFS. – uHadoop Oct 04 '10 at 11:52
I should note that this tool has limitations highlighted at https://issues.apache.org/jira/browse/HDFS-950. Files must have the same block size and be owned by the same user. – Jeff Hammerbacher Oct 05 '10 at 09:23

score 1 · Answer 6 · edited Mar 08 '23 at 09:01

If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

You can download this jar from Get hadoop streaming jar

If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

This will merge all part files into one and save it again into hdfs location

score 0 · Answer 7 · answered Jan 26 '17 at 14:30

Addressing this from Apache Pig perspective,

To merge two files with identical schema via Pig, UNION command can be used

 A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
 B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1) 
 C = UNION A,B
 store C into 'tmp/fileoutput' Using PigStorage('\t')

score 0 · Answer 8 · edited Jan 26 '17 at 17:52

0

All the solutions are equivalent to doing a

hadoop fs -cat [dir]/* > tmp_local_file  
hadoop fs -copyFromLocal tmp_local_file

it only means that the local m/c I/O is on the critical path of data transfer.

edited Jan 26 '17 at 17:52

mrsrinivas

34,112
13
125
125

answered Jun 27 '11 at 04:37

samurai

11

Merging multiple files into one within Hadoop

8 Answers8

Linked