2

I'm trying to load gzipped files from a directory on a remote machine onto the HDFS of my local machine. I want to be able to read the gzipped files from the remote machine and pipe them directly into the HDFS on my local machine. This is what I've got on the local machine:

ssh remote-host "cd /files/wanted; tar -cf - *.gz" | tar -xf - | hadoop fs -put - "/files/hadoop"

This apparently copies all of the gzipped files from the remote path specified to the path where I execute the command and loads an empty file - into the HDFS. The same thing happens if I try it without tar also:

ssh remote-host "cd /files/wanted; cat *.gz" | hadoop fs -put - "/files/hadoop"

Just for shits and giggles to see if I was maybe missing something simple, I tried the following on my local machine:

tar -cf - *.gz | tar -xf -C tmp

This did what I expected, it took all of the gzipped files in the current directory and put them in an existing directory tmp.

Then with the Hadoop part on the local machine:

cat my_file.gz | hadoop fs -put - "/files/hadoop"

This also did what I expected, it put my gzipped file into /files/hadoop on the HDFS.

Is it not possible to pipe multiple files into HDFS?

kurczynski
  • 359
  • 9
  • 17
  • I read it again and again, and I couldn't understand which part exactly isn't working for you :-/ – maksimov Dec 19 '14 at 23:51
  • @maksimov so it copies the files in the first two commands from the remote host to the local host which isn't supposed to happen (so I thought). It should go right into HDFS, for some reason piping multiple files into HDFS that doesn't work. – kurczynski Dec 20 '14 at 02:55
  • This is relevant: http://stackoverflow.com/questions/11270509/putting-a-remote-file-into-hadoop-without-copying-it-to-local-disk, they are going the other way however, but it may give you some clues. Note that the OP has found a performance issue when piping directly into hdfs. – maksimov Dec 20 '14 at 18:04
  • @maksimov yeah that's exactly what I'm able to do right now, it's multiple files that's the problem. Hm, I see where he mentions performance issues with piping, that doesn't make sense though. I guess I'll give both ways a try and see if that is the case for some odd reason. – kurczynski Dec 29 '14 at 23:00

1 Answers1

3

For whatever reason, I can't seem to pipe multiple files into HDFS. So what I ended up doing was creating a background SSH session so I don't have to create one for every single file I want to load:

ssh -fNn remote-host

And then iterating over the list of files I need to load into HDFS and pipe each one in:

for file in /files/wanted/*; do
  ssh -n remote-host "cat $file" | "hadoop fs -put - /files/hadoop/$file"
done

Also make sure to close the SSH session:

ssh -O exit remote-host
kurczynski
  • 359
  • 9
  • 17