3

I have a directory that contains lots of files and sub directories that I want to compress and export from hdfs to fs.

I came across this question - Hadoop: compress file in HDFS? , but it seems like it's relevant only to files, and using hadoop-streaming and the GzipCodec gave me no success with directories.

What is the most efficient why to compress HDFS folder into single gzip file?
Thanks in advance.

Elad Leev
  • 908
  • 7
  • 17
  • 1
    You can't `gzip` a directory even on Unix's FS. You need to first convert it to a `tar/har` or something like that and then perform compression. – philantrovert May 29 '17 at 14:11
  • @philantrovert Of course, but do you have any advice about how to do so? – Elad Leev May 29 '17 at 14:30
  • I'd suggest writing a Java program using the apache commons Api. It has classes like `TarArchiveOutputStream` which you can look into. – philantrovert May 29 '17 at 16:23

2 Answers2

1

You will need a library or roll your own code to make a tar stream out of files in a directory structure. You can use zlib to compress the tar stream to make a standard .tar.gz file.

The two tidbits I can provide here if you want to merge the results of multiple such tasks are: 1) you can concatenate gzip streams to make valid gzip streams, and 2) you can concatenate tar streams to make a valid tar stream if you remove the final 1024 zero bytes from the non-final tar streams.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
-1

For a quick, dirty solution, for those of you who don't want to use hadoop-streaming or any MapReduce job for it, I used FUSE and then preform actions on it as traditional filesystem.
Pay attention that you might don't want to use this as a permanent solution, only for a quick win :)
Further reading:
* https://hadoop.apache.org/docs/r1.2.1/streaming.html
* http://www.javased.com/index.php?api=org.apache.hadoop.io.compress.GzipCodec

Elad Leev
  • 908
  • 7
  • 17