3

I'm trying to retrieve a directory of text files that total several gigabytes from Hadoop HDFS. I can do this with

hadoop hdfs -get /path/to/directory/* .

But my link to the Hadoop cluster is about 1 MB/s, so that takes quite a while. Like most text files these ones compress very well, so I would like them to be compressed for the download. Does the hadoop hdfs -get command automatically compress during transit (the way http and many other protocols can)?

If not, what is the most straightforward way to get the files using compression? If it matters, the cluster is running CDH 4.5.0 and I do not have any kind of administrator rights on the cluster.

I've found this question, but that is talking about compressing a file to keep in HDFS, and it seems like there ought to be a way to compress the bytes in transit without creating, getting, and then deleting a compressed copy. From my understanding of typical Hadoop usage, it seems that getting and putting very large text files ought to be a typical use case, and it's well established that text files compress well.

I'll also accept an answer that shows that this is a documented missing feature that has either been purposefully left out of Hadoop or is expected to be added in some future release.

Community
  • 1
  • 1
Tom Panning
  • 4,613
  • 2
  • 26
  • 47

2 Answers2

1

I believe the assumption is that most people already use file level compression in HDFS, so applying transport level compression would not gain you anything.

You also have to be careful to not use certain types of compression because then you couldn't easily split the file for input to map-reduce jobs. You want to use either Snappy or LZO since those are "splittable" input files whereas Gzip is not.

I'm sure if you were willing to provide a patch to Hadoop, they would be willing to accept a change that supported compression in -get (and maybe -put as well) assuming that it was optional.

The implementation for -get is found in CopyCommands.java. You can see that it uses IOUtils.copyBytes to do the copying on an FSDataOutputStream. You would need to layer in the compression at that point, but it is currently not done.

However, it would probably be better to provide transparent compression in HDFS similar to how MapR provides it.

b4hand
  • 9,550
  • 4
  • 44
  • 49
  • It looks like someone else has already proposed transparent compression: https://issues.apache.org/jira/browse/HDFS-2115 but it doesn't look like that ticket is getting much activity. – Tom Panning May 02 '14 at 01:49
  • The easiest way to get action on a ticket is to provide a patch. – b4hand May 02 '14 at 15:03
  • I'm not up to adding transparent compression. But I should be able to add optional compression for `-get` and `-put`, so I added a ticket for that https://issues.apache.org/jira/browse/HDFS-6323 – Tom Panning May 02 '14 at 16:52
0

Since you got a low bandwidth, the compression has to take place prior to getting the file on the local machine. You need to run a MapReduce job with LZO or any other compression codec configured on your cluster. In that way you will have a compressed output which you can then download. Since the job would be running in the cluster and it would be faster taking data locality into effect.

Take a look at Hadoop HAR, that does exactly as mentioned above. It runs a MR and creates a compressed Hadoop Archive. You can download the same using -getToLocal command and open it using WINRAR. For more information take a look at Hadoop Archives

  • I'm just surprised that some amount of compression isn't built into the `hadoop hdfs -get` and `hadoop hdfs -put` commands, or some equivalent commands. Don't a lot of people have to upload/download files in the GB or TB range? – Tom Panning Apr 28 '14 at 12:01
  • The MapR distribution has compression built in, however i do not think any of the other distributions or core hadoop has this facility as of now. More details on MapR http://answers.mapr.com/questions/38/what-compression-algorithm-does-mapr-use – Sudarshan Apr 30 '14 at 08:44