I'm trying to retrieve a directory of text files that total several gigabytes from Hadoop HDFS. I can do this with
hadoop hdfs -get /path/to/directory/* .
But my link to the Hadoop cluster is about 1 MB/s, so that takes quite a while. Like most text files these ones compress very well, so I would like them to be compressed for the download. Does the hadoop hdfs -get
command automatically compress during transit (the way http and many other protocols can)?
If not, what is the most straightforward way to get the files using compression? If it matters, the cluster is running CDH 4.5.0 and I do not have any kind of administrator rights on the cluster.
I've found this question, but that is talking about compressing a file to keep in HDFS, and it seems like there ought to be a way to compress the bytes in transit without creating, getting, and then deleting a compressed copy. From my understanding of typical Hadoop usage, it seems that getting and putting very large text files ought to be a typical use case, and it's well established that text files compress well.
I'll also accept an answer that shows that this is a documented missing feature that has either been purposefully left out of Hadoop or is expected to be added in some future release.