10

I've set up a small Hadoop cluster for testing. Setup went fairly well with the NameNode (1 machine), SecondaryNameNode (1) and all DataNodes (3). The machines are named "master", "secondary" and "data01", "data02" and "data03". All DNS are properly set up, and passwordless SSH was configured from master/secondary to all machines and back.

I formatted the cluster with bin/hadoop namenode -format, and then started all services using bin/start-all.sh. All processes on all nodes were checked to be up and running with jps. My basic configuration files look something like this:

<!-- conf/core-site.xml -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <!-- 
      on the master it's localhost
      on the others it's the master's DNS
      (ping works from everywhere)
    -->
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <!-- I picked /hdfs for the root FS -->
    <value>/hdfs/tmp</value>
  </property>
</configuration>

<!-- conf/hdfs-site.xml -->
<configuration>
  <property>
    <name>dfs.name.dir</name>
    <value>/hdfs/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/hdfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
</configuration>

# conf/masters
secondary

# conf/slaves
data01
data02
data03

I'm just trying to get HDFS running properly now.

I've created a dir for testing hadoop fs -mkdir testing, then tried to copy some files into it with hadoop fs -copyFromLocal /tmp/*.txt testing. This is when hadoop crashes, giving me more or less this:

WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
  at ... (such and such)

WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
  at ...

WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
  at ...

ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
  at ...

And so on. A similar issue occurs when I try to run hadoop fs -lsr . from a DataNode machine, only to get the following:

12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...

I'm saying it's similar, because I suspect this is a port availability issue. Running telnet master 9000 reveals that the port is closed. I've read somewhere that this might be an IPv6 clash issue, and thus defined the following in conf/hadoop-env.sh:

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

But that didn't do the trick. Running netstat on the master reveals something like this:

Proto Recv-Q Send-Q  Local Address       Foreign Address      State
tcp        0      0  localhost:9000      localhost:56387      ESTABLISHED
tcp        0      0  localhost:56386     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56387     localhost:9000       ESTABLISHED
tcp        0      0  localhost:56384     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56385     localhost:9000       TIME_WAIT
tcp        0      0  localhost:56383     localhost:9000       TIME_WAIT

At this point I'm pretty sure the problem is with the port (9000), but I'm not sure what I missed as far as configuration goes. Any ideas? Thanks.

update

I found that hard coding DNS names into /etc/hosts not only help resolve this, but also speeds up the connections. The downside is that you have to do this on all the machines in the cluster, and again when you add new nodes. Or you can just set up a DNS server, which I didn't.

Here's a sample of my one node in my cluster (nodes are named hadoop01, hadoop02, etc, with the master and secondary being 01 and 02). Node that most of it are generated by the OS:

# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes

# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on

# --- End list of nodes

# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain

Hope this helps.

Chris Snow
  • 23,813
  • 35
  • 144
  • 309
sa125
  • 28,121
  • 38
  • 111
  • 153

2 Answers2

9

Replace localhost in hdfs://localhost:9000 with ip-address or hostname for the fs.default.name property in NameNode when there are remote nodes connecting to the NameNode.

All processes on all nodes were checked to be up and running with jps

There might be some errors in the log files. jps makes sure that the process is running.

Chris Snow
  • 23,813
  • 35
  • 144
  • 309
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • 1
    you're right - it turned out to be a DNS resolution issue. It seems the term localhost is confusing for all machines. What I ended up doing is editing /etc/hosts on all the servers, and setting the DNS manually. Thanks! – sa125 Jan 03 '12 at 07:21
  • @sa125 could you post an example of your /etc/hosts to illustrate your changes? I'm stuck with the same problem. – romedius Sep 24 '12 at 07:26
  • Thanks, I got it to run just before leaving work :-) Just one question: why do you have `hadoop01` as an alias for the full IP and localhost? Are there benefits from this configuration? – romedius Sep 24 '12 at 10:47
  • Funny, I added the master as a synonym for localhost + IP, instead of setting the IP in the core-site.xml and the sample applicationruns faster now. 1:24 instead of 15 minutes. (2 VM's, fully distributed mode, the example is `hadoop jar /usr/share/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'`) Thanks a lot! – romedius Sep 25 '12 at 02:17
0

Correct your /etc/hosts file to include localhost or correct your core-site file to specify ip or hostname of node that hosts HDFS filesystem.

Pankaj Parkar
  • 134,766
  • 23
  • 234
  • 299
Baban Gaigole
  • 351
  • 1
  • 8
  • I have meet this problem too! Because my machine has a default setting like : `127.0.1.1 ubuntu01`(ubuntu01 is on of my cluster's datanode). so if you use this setting to start namenode, it can not listen to other ip from another datanode. – djzhu Dec 03 '17 at 09:17