23

I've set up a distributed Hadoop environment within VirtualBox: 4 virtual Ubuntu 11.10 installations, one acting as the master node, the other three as slaves. I followed this tutorial to get the single-node version up and running and then converted to the fully-distributed version. It was working just fine when I was running 11.04; however, when I upgraded to 11.10, it broke. Now all my slaves' logs show the following error message, repeated ad nauseum:

INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 0 time(s).
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 1 time(s).
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 2 time(s).

And so on. I've found other instances of this error message on the Internet (and StackOverflow) but none of the solutions have worked (tried changing the core-site.xml and mapred-site.xml entries to be the IP address rather than hostname; quadruple-checked /etc/hosts on all slaves and master; master can SSH password-less into all slaves). I even tried reverting each slave back to a single-node setup, and they would all work fine in this case (on that note, the master always works fine as both a Datanode and the Namenode).

The only symptom I've found that would seem to give a lead is that from any of the slaves, when I attempt a telnet 192.168.1.10 54310, I get Connection refused, suggesting there is some rule blocking access (which must have gone into effect when I upgraded to 11.10).

My /etc/hosts.allow has not changed, however. I tried the rule ALL: 192.168.1., but it did not change the behavior.

Oh yes, and netstat on the master clearly shows tcp ports 54310 and 54311 are listening.

Anyone have any suggestions to get the slave Datanodes to recognize the Namenode?

EDIT #1: In doing some poking around with nmap (see comments on this post), I'm thinking the issue is in my /etc/hosts files. This is what is listed for the master VM:

127.0.0.1    localhost
127.0.1.1    master
192.168.1.10 master
192.168.1.11 slave1
192.168.1.12 slave2
192.168.1.13 slave3

For each slave VM:

127.0.0.1    localhost
127.0.1.1    slaveX
192.168.1.10 master
192.168.1.1X slaveX

Unfortunately, I'm not sure what I changed, but the NameNode is now always dying with the exception of trying to bind a port "that's already in use" (127.0.1.1:54310). I'm clearly doing something wrong with the hostnames and IP addresses, but I'm really not sure what it is. Thoughts?

Community
  • 1
  • 1
Magsol
  • 4,640
  • 11
  • 46
  • 68
  • Are you running a firewall? Also, is the Master's IP still 192.168.1.10? Stupid questions, but sometimes people miss the obvious stuff. – Chris Shain Jan 16 '12 at 04:34
  • Install gufw using the `sudo apt-get install gufw` command and check the firewall settings. Also check the [network connection type](http://www.virtualbox.org/manual/ch06.html) in [VirtualBox](http://www.virtualbox.org/manual/ch06.html). – Praveen Sripati Jan 16 '12 at 04:43
  • `Anyone have any suggestions to get the slave Datanodes to recognize the Namenode?` - this is more of a Ubuntu query than a Hadoop one? It should be `how to get the slave VMs talk to the master VM`. – Praveen Sripati Jan 16 '12 at 06:29
  • @ChrisShain: It's the default Ubuntu 11.10 setup: no active firewall, and the IP is still the same (I have my router set to provide static IPs based on MAC address; my VirtualBox is set up to provide bridged networking, so the MAC addresses of each VM should remain the same as well). Always good to have these question asked :) – Magsol Jan 16 '12 at 15:49
  • @PraveenSripati: Network connection type is Bridged; that hasn't changed since the initial setup when it was working fine under 11.04. It may very well turn out to be more of an Ubuntu query than a Hadoop one, but because I don't know where the problem is, it may be a misconfigured Hadoop setup (though at this point I'm thinking not) or a misconfigured Ubuntu network (most likely). – Magsol Jan 16 '12 at 15:51
  • Has anything at all changed about how you connect the physical machine to the network? Any chance that you went from wired to Wifi? Lots of Wifi routers by default are configured for access point isolation. Also, can you ping the host machine from the VMs, and vice versa? – Chris Shain Jan 16 '12 at 16:00
  • @ChrisShain: No, the physical machine is a wired desktop, that has not changed. I have not altered any settings on the router in this time, either. The host machine can ping all the VMs and vice versa; in addition, the slave VMs can ping the master VM and SSH (though I have not set up password-less SSH from slave to master, but that is not a requirement for Hadoop). I'll test out the `gufw` suggestion above ASAP. – Magsol Jan 16 '12 at 19:08
  • Yeah almost guaranteed to be a firewall issue then, since connectivity works and the master is listening. Not too many other possibilities. – Chris Shain Jan 16 '12 at 19:09
  • Hmm. I installed `gufw`, turned on the firewall, and set to accept incoming connections from 54310 and 54311; though to experiment, I also set it to accept all connections. In neither case were the slaves able to connect. Furthermore, I (finally) installed `nmap` on the slaves: my HTTP and non-standard SSH ports show up as open, but `nmap -sU -p 54310,54311 192.168.1.10` shows definitive `closed` ports, even when I set the firewall to Allow all. I'm kind of out of ideas here. – Magsol Jan 17 '12 at 02:53
  • Just noticed: Hadoop binds a bunch of non-standard ports for viewing the status over HTTP (50030, etc), and these work just fine. I noticed in `netstat` that the binds look like: `0.0.0.0:50030`, but the bind for the NameNode/TaskTracker look like: `127.0.1.1:54310`. Is this difference significant? – Magsol Jan 17 '12 at 02:58
  • Ok, the problem has to be in the master's `/etc/hosts` file. See my edits to the question. – Magsol Jan 17 '12 at 03:33

6 Answers6

39

I found it! By commenting out the second line of the /etc/hosts file (the one with the 127.0.1.1 entry), netstat shows the NameNode ports binding to the 192.168.1.10 address instead of the local one, and the slave VMs found it. Ahhhhhhhh. Mystery solved! Thanks for everyone's help.

Magsol
  • 4,640
  • 11
  • 46
  • 68
  • thanks mate, I have been trying this and that for hours...had the same problem. cheers – Mr.Zeiss Jun 16 '13 at 21:30
  • You mean to say, commenting 127.0.0.1 ip with localhost localhost.localdomain...? – Yogesh D Mar 13 '17 at 20:25
  • No, the `127.0.1.1` entry. – Magsol Mar 13 '17 at 20:47
  • I am still facing the same issue? could you please help?My namenode is running on 192.168.1.200:9000 and I dont have entry with 127.0.1.1, i have 127.0.0.1, the localhost one – Yogesh D Mar 13 '17 at 20:51
  • If `127.0.0.1 localhost` is the *only* entry in your hosts file, that could be the problem. You have to refer to the namenode by an address that is visible from the network, e.g. something in the `192.168.xxx.xxx` range, and this entry usually goes in the hosts file after the `localhost` entry. – Magsol Mar 14 '17 at 13:16
  • 1
    Also be sure to read the rest of the answers to this question, they provide additional useful information on troubleshooting the problem. – Magsol Mar 14 '17 at 13:17
6

This solution worked for me. i.e make sure that the name you used in property in core-site.xml and mapred-site.xml :

<property>
   <name>fs.default.name</name>
   <value>hdfs://master:54310</value>
   <final>true</final>
 </property>

i.e. master is defined in /etc/hosts as xyz.xyz.xyz.xyz master on BOTH master and slave nodes. Then restart the namenode and check using netstat -tuplen and to see that it is bound to the "external" IP address

tcp        0      xyz.xyz.xyz.xyz:54310         0.0.0.0:*                   LISTEN      102        107203     - 

and NOT local IP 192.168.x.y or 127.0.x.y

António Ribeiro
  • 4,129
  • 5
  • 32
  • 49
devl
  • 429
  • 6
  • 15
3

I had the same trouble. @Magsol solution worked but it should be noted that the entry that needs to be commented out is

127.0.1.1 masterxyz

on the master machine, not the 127.0.1.1 on the slave, though I did that too. Also you need to stop-all.sh and start-all.sh for hadoop, probably obvious.

Once you have restarted hadoop check the nodemaster here: http://masterxyz:50030/jobtracker.jsp

and look at the number of nodes available for jobs.

pferrel
  • 5,673
  • 5
  • 30
  • 41
  • 1
    Thanks pferrel to make it clear that it's just namenode which is looping back on localhost and we need to just modify /etc/hosts (remove 127.0.1.1) and just restart all the hadoop processes. – user1501382 Feb 01 '15 at 05:40
1

Though this response is not the solution the author is looking for, other users might land on this page thinking otherwise, so if you are using AWS for setting up your cluster, it is likely that ICMP security rules haven't been enabled in AWS Security Groups page. Look at the following: Pinging EC2 instances

The above solved the connectivity issue from data nodes to master nodes. Ensure that you can ping between each instance.

Community
  • 1
  • 1
MasterV
  • 1,162
  • 1
  • 13
  • 18
1

I am running a 2-nodes cluster.

192.168.0.24 master
192.168.0.26 worker2

I was facing the same problem of Retrying connect to server: master/192.168.0.24:54310 in my worker2 machine logs. But the people mentioned above encountered errors running this command - telnet 192.168.0.24 54310. However, in my case the telnet command worked fine. Then I checked my /etc/hosts file

master /etc/hosts
127.0.0.1 localhost
192.168.0.24 ubuntu
192.168.0.24 master
192.168.0.26 worker2

worker2 /etc/hosts
127.0.0.1 localhost
192.168.0.26 ubuntu
192.168.0.24 master
192.168.0.26 worker2

When I hit http://localhost:50070 on master, I saw Live nodes : 2. But when I clicked on it, I saw only one datanode which was of master's. I checked jps both on master and worker2. Datanode process was running on both the machines.

Then after several trial and errors, I realized that my master and worker2 machines had the same host name "ubuntu". I changed the worker2's hostname from "ubuntu" to "worker2" and removed the "ubuntu" entry from the worker2 machine.

Note: To change the hostname edit the /etc/hostname with sudo.

Bingo! It worked :) I was able to see two datanodes on the dfshealth UI page ( locahost:50070)

Vignesh Iyer
  • 300
  • 3
  • 6
1

I also faced similar issue. (I am using ubuntu 17.0) I kept only the entries of master and slaves in /etc/hosts file. (in both master and slave machines)

127.0.0.1  localhost
192.168.201.101 master
192.168.201.102 slave1
192.168.201.103 slave2

secondly, > sudo gedit /etc/hosts.allow and add the entry : ALL:192.168.201.

thirdly, disabled the firewall using sudo ufw disable

finally, I deleted both namenode and datanode folders from all the nodes in cluster, and rerun

$HADOOP_HOME/bin> hdfs namenode -format -force
$HADOOP_HOME/sbin> ./start-dfs.sh
$HADOOP_HOME/sbin> ./start-yarn.sh

To check the health report from command line (which I would recommend)

$HADOOP_HOME/bin> hdfs dfsadmin -report

and I got all the nodes working correctly.

Raxit Solanki
  • 434
  • 6
  • 15