Merging multiple LZO compressed files on HDFS

Question

Let's say I have this structure on HDFS:

/dir1
    /dir2
        /Name1_2015/
            file1.lzo
            file2.lzo
            file3.lzo
        /Name2_2015
            file1.lzo
            file2.lzo

    Name1_2015.lzo

I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo

For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo

Each files are LZO compressed.

How can I do it ?

Thanks

Mikhail Golubtsov · Answer 1 · 2015-07-27T12:55:23.603

3

If you don't care much about parallelism here's a bash one-liner:

for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop  | hdfs dfs -put - /dir1/$d.lzo ; done

You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

edited Jul 27 '15 at 12:55

answered Jul 27 '15 at 09:22

Mikhail Golubtsov

6,285
3
29
36

With this script data are pulled to a local node then pushed to HDFS, right ? Is there a way to avoid retrieving all data to a single node, merging then pushing the merged file ? – guillaume Jul 27 '15 at 09:34
Even if I want to append is it not possible ? As it's LZO compressed I have to decompress main file append to it then re-compress it. I can't append directly LZO because of headers, right ? – guillaume Jul 27 '15 at 12:20
I was wrong about append, we can't append concurrently too - http://stackoverflow.com/questions/6389594/is-it-possible-to-append-to-hdfs-file-from-multiple-clients-in-parallel HDFS design implies that there is only one writer per file. – Mikhail Golubtsov Jul 27 '15 at 12:55
Even If I use FileUtil.copyMerge to merge 2 non compressed files, it will pull all the data to a single node before merging ? – guillaume Jul 27 '15 at 13:03
Yes, it will be done in jvm memory, check out source code http://hadoop.apache.org/docs/r2.7.1/api/src-html/org/apache/hadoop/fs/FileUtil.html – Mikhail Golubtsov Jul 27 '15 at 13:49

score 2 · Answer 2 · answered Jul 31 '15 at 22:10

I would do this with Hive, as follows:

Rename the subdirectories to name=1_2015 and name=2_2015
CREATE EXTERNAL TABLE sending_table ( all_content string ) PARTITIONED BY (name string) LOCATION "/dir1/dir2" ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}
Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.
Run this:

SET mapreduce.job.reduces=1 # this guarantees it'll make one file SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true

insert into table receiving select all_content from sending_table

score 1 · Answer 3 · answered Jul 24 '15 at 17:24

1

You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

answered Jul 24 '15 at 17:24

Karthik

1,801
1
13
21

I know its overhead to merge all files but i really need à single merged file to process it later – guillaume Jul 25 '15 at 10:06

Merging multiple LZO compressed files on HDFS

3 Answers3