2

Is it possible to store the output of the hadoop dfs -getmerge command to another machine?

The reason is that there is no enough space in my local machine. The job output is 100GB and my local storage is 60GB.

Another possible reason could be that I want to process the output in another program locally, in another machine and I don't want to transfer it twice (HDFS-> local FS -> remote machine). I just want (HDFS -> remote machine).

I am looking for something similar to how scp works, like:

hadoop dfs -getmerge /user/hduser/Job-output user@someIP:/home/user/

Alternatively, I would also like to get the HDFS data from a remote host to my local machine.

Could unix pipelines be used in this occasion?

For those who are not familiar with hadoop, I am just looking for a way to replace a local dir parameter (/user/hduser/Job-output) in this command with a directory on a remote machine.

vefthym
  • 7,422
  • 6
  • 32
  • 58

1 Answers1

2

This will do exactly what you need:

hadoop fs -cat /user/hduser/Job-output/* | ssh user@remotehost.com "cat >mergedOutput.txt"

fs -cat will read all files in sequence and output them to stdout.

ssh will pass them to a file on remote machine (note that scp will not accept stdin as input)

markob
  • 86
  • 5
  • That was a great answer! Exactly what I needed! Actually, I wanted to connect to the remote host, where HDFS is, so the command is the other way round: `ssh user@remotehost.com "hadoop fs -cat /user/hduser/Job-output/part-*" | cat > mergedOutput.txt`. I edited your answer to include this command and also add `/Job-output/part-*`, instead of `/Job-output/*` to get only the results – vefthym Jul 15 '14 at 09:01