10

Let's say I have this structure on HDFS:

/dir1
    /dir2
        /Name1_2015/
            file1.lzo
            file2.lzo
            file3.lzo
        /Name2_2015
            file1.lzo
            file2.lzo

    Name1_2015.lzo

I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo

For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo

Each files are LZO compressed.

How can I do it ?

Thanks

guillaume
  • 1,638
  • 5
  • 24
  • 43

3 Answers3

3

If you don't care much about parallelism here's a bash one-liner:

for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop  | hdfs dfs -put - /dir1/$d.lzo ; done

You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

Mikhail Golubtsov
  • 6,285
  • 3
  • 29
  • 36
  • With this script data are pulled to a local node then pushed to HDFS, right ? Is there a way to avoid retrieving all data to a single node, merging then pushing the merged file ? – guillaume Jul 27 '15 at 09:34
  • Even if I want to append is it not possible ? As it's LZO compressed I have to decompress main file append to it then re-compress it. I can't append directly LZO because of headers, right ? – guillaume Jul 27 '15 at 12:20
  • I was wrong about append, we can't append concurrently too - http://stackoverflow.com/questions/6389594/is-it-possible-to-append-to-hdfs-file-from-multiple-clients-in-parallel HDFS design implies that there is only one writer per file. – Mikhail Golubtsov Jul 27 '15 at 12:55
  • Even If I use FileUtil.copyMerge to merge 2 non compressed files, it will pull all the data to a single node before merging ? – guillaume Jul 27 '15 at 13:03
  • Yes, it will be done in jvm memory, check out source code http://hadoop.apache.org/docs/r2.7.1/api/src-html/org/apache/hadoop/fs/FileUtil.html – Mikhail Golubtsov Jul 27 '15 at 13:49
2

I would do this with Hive, as follows:

  1. Rename the subdirectories to name=1_2015 and name=2_2015

  2. CREATE EXTERNAL TABLE sending_table ( all_content string ) PARTITIONED BY (name string) LOCATION "/dir1/dir2" ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}

  3. Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.

  4. Run this:

    SET mapreduce.job.reduces=1 # this guarantees it'll make one file SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true

    insert into table receiving select all_content from sending_table

Robert Rapplean
  • 672
  • 1
  • 9
  • 30
1

You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

Karthik
  • 1,801
  • 1
  • 13
  • 21