Hadoop fs getmerge to remote server/machine due to low disk space

Question

I have the same question as this other post: hadoop getmerge to another machine but the answer does not work for me

To summarize what I want to do: get merge (or get the files) from the hadoop cluster, and NOT copy to the local machine (due to low or no disk space), but directly transfer them to a remote machine. I have my public key in the remote machine authorized keys list, so no password authentication is necessary.

My usual command on the local machine is (which merges and puts the file onto the local server/machine as a gzip file):

hadoop fs -getmerge folderName.on.cluster merged.files.in.that.folder.gz

I tried as in the other post:

hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com:/storage | "cat > mergedoutput.txt"

This did not work for me.. I get these kind of errors..

Pseudo-terminal will not be allocated because stdin is not a terminal. ssh: Could not resolve hostname user@remotehost.com:/storage /: Name or service not known

and I tried it the other way ssh user@remotehost.com:/storage "hadoop fs -cat folderName.on.cluster/*" | cat > mergedoutput.txt Then:

-bash: cat > mergedoutput.txt: command not found
Pseudo-terminal will not be allocated because stdin is not a terminal.
-bash: line 1: syntax error near unexpected token `('

Any help is appreciated. I also don't need to do -getmerge, I could also do -get and then just merge the files once copied over to the remote machine. Another alternative is if there is a way I can run a command on the remote server to directly copy the file from the hadoop cluster server.

Thanks

Figured it out hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com "cd storage; cat > mergedoutput.txt"

This is what works for me. Thanks to @vefthym for the help.

This merges the files in the directory on the hadoop cluster, to the remote host without copying it to the local host YAY (its pretty full already). Before I copy the file, I need to change to another directory I need the file to be in, hence the cd storage; before cat merged output.gz

score 1 · Accepted Answer · edited May 23 '17 at 12:20

1

I'm glad that you found my question useful!

I think your problem is just in the ssh, not in the solution that you describe. It worked perfectly for me. By the way, in the first command, you have an extra '|' character. What do you get if you just type ssh user@remotehost.com? Do you type a name, or an IP? If you type a name, it should exist in /etc/hosts file.

Based on this post, I guess you are using cygwin and have some misconfigurations. Apart from the accepted solution, check if you have installed the openssh cygwin package, as the second best answer suggests.

edited May 23 '17 at 12:20

Community

1
1

answered Dec 24 '14 at 11:57

vefthym

7,422
6
32
58

Thanks for your reply. I ssh to a name (not ip dotted numbers). It does not exist in the `/etc/hosts` file (which is owned by root, I am just a user on the server. It is not cygwin. the ssh version is `OpenSSH_5.5p1 Debian-6, OpenSSL 0.9.8o 01 Jun 2010` I just tried the 1st command taking out the '|' `'hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com:/storage "cat > mergedoutput.txt"` and I get: `ssh: Could not resolve hostname user@remotehost.com:/storage : Name or service not known` followed by `cat: Unable to write to output stream.` repeated for each file in the dir – KBA Dec 25 '14 at 03:28
So I am thinking then the above happened because `remotehost.com` is not in the `/etc/hosts` file? and I need to ask the manager of the server to add it? Thanks! – KBA Dec 25 '14 at 03:30
1

or just type the ip address – vefthym Dec 25 '14 at 09:07
I got it to partially work.. I used the IP address, then I used the characters, and it worked that way as well. the remaining issue is that it can merged and copy the files to the default directory.. I can not specify a path like `:/storage` to ssh. SO I figured out that this works: `'hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com "cd storage; cat > mergedoutput.txt"` – KBA Dec 26 '14 at 05:04

score 1 · Answer 2 · answered Dec 26 '14 at 05:08

hadoop fs -cat folderName.on.cluster/* | ssh user@remotehost.com "cd storage; cat > mergedoutput.txt"

This is what works for me. Thanks to @vefthym for the help.

This merges the files in the directory on the hadoop cluster, to the remote host without copying it to the local host YAY (its pretty full already). Before I copy the file, I need to change to another directory I need the file to be in, hence the cd storage; before cat merged output.gz

Hadoop fs getmerge to remote server/machine due to low disk space

2 Answers2