I have the same question as this other post: hadoop getmerge to another machine but the answer does not work for me
To summarize what I want to do: get merge (or get the files) from the hadoop cluster, and NOT copy to the local machine (due to low or no disk space), but directly transfer them to a remote machine. I have my public key in the remote machine authorized keys list, so no password authentication is necessary.
My usual command on the local machine is (which merges and puts the file onto the local server/machine as a gzip file):
hadoop fs -getmerge folderName.on.cluster merged.files.in.that.folder.gz
I tried as in the other post:
hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com:/storage | "cat > mergedoutput.txt"
This did not work for me.. I get these kind of errors..
Pseudo-terminal will not be allocated because stdin is not a terminal.
ssh: Could not resolve hostname user@remotehost.com:/storage /: Name or service not known
and I tried it the other way
ssh user@remotehost.com:/storage "hadoop fs -cat folderName.on.cluster/*" | cat > mergedoutput.txt
Then:
-bash: cat > mergedoutput.txt: command not found
Pseudo-terminal will not be allocated because stdin is not a terminal.
-bash: line 1: syntax error near unexpected token `('
Any help is appreciated. I also don't need to do -getmerge
, I could also do -get
and then just merge the files once copied over to the remote machine. Another alternative is if there is a way I can run a command on the remote server to directly copy the file from the hadoop cluster server.
Thanks
Figured it out
hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com "cd storage; cat > mergedoutput.txt"
This is what works for me. Thanks to @vefthym for the help.
This merges the files in the directory on the hadoop cluster, to the remote host without copying it to the local host YAY (its pretty full already). Before I copy the file, I need to change to another directory I need the file to be in, hence the cd storage;
before cat merged output.gz