0

gz files in one directory. I want to combine them in one big .gz file and unzip it and load it into HDFS.

For e.g. repo contains files a.gz,b.gz,c.gz. Now I want to combine them into one file called d.gz I want to unzip it and load into HDFS. These .gz files are CSV files.

To unzip it I know I can GZIPInput/OutputStream but how do I combine files into one big files in Java.

Please guide. Thanks in advance.

Umesh K
  • 13,436
  • 25
  • 87
  • 129
  • You could do `cat a.gz b.gz c.gz > d.gz`, but then when you `gunzip d.gz` it's the same as just doing `cat a b c` (in other words you don't get a, b, and c individually; you get them all concatenated together). If you want a, b, and c individually, you need to use some archiving file format, like tar or zip. – Cornstalks May 09 '14 at 17:50
  • You may extend from http://stackoverflow.com/questions/2223434/appending-files-to-a-zip-file-with-java?lq=1 – Jayan May 09 '14 at 17:57

1 Answers1

2

A gz file contains exactly one file. It's not meant to contain multiple files.

The best way to do this is TAR the files together then GZ the resulting TAR. TAR has command line options to automate this into a single operation. For Java, use jtar: https://code.google.com/p/jtar/

Alternatively, a ZIP file may be what you're looking for.

StilesCrisis
  • 15,972
  • 4
  • 39
  • 62