9

With HDFS or HFTP URI scheme (e.g. hdfs://namenode/path/to/file) I can access HDFS clusters without requiring their XML configuration files. It is very handy when running shell commands like hdfs dfs -get, hadoop distcp or reading files from Spark like sc.hadoopFile(), because I don't have to copy and manage xml files for all relevant HDFS clusters to all nodes that those codes might potentially run.

One drawback of this approach is that I have to use the active NameNode's hostname, otherwise Hadoop will throw an exception complaining that the NN is standby.

A usual workaround is to try one and then try another if any exception is caught, or to connect to ZooKeeper directly and parse the binary data using protobuf.

Both of these methods are cumbersome, when compared to (for example) mysql's loadbalance URI or ZooKeeper's connection string where I can just comma-separate all hosts in the URI and the driver automatically finds a node to talk to.

Say I have active and standby namenode hosts nn1 and nn2. What is the simplest way to refer a specific path of the HDFS, which:

  • can be used in command-line tools like hdfs, hadoop
  • can be used in Hadoop Java API (and thus tools depending on it like Spark) with minimum configuration
  • works regardless of which namenode is currently active.
lyomi
  • 4,230
  • 6
  • 30
  • 39
  • 1
    @JulianV.Modesto no luck yet :( My team's using a simple java application that I've made, which queries ZooKeeper and parse the active namenode. – lyomi Jan 17 '16 at 07:56
  • Thanks @Iyomi, good to know. I've ended up doing that as well. – Julian V. Modesto Jan 18 '16 at 18:25
  • In Spark you can refer to the NameNode service where you would normally put your namenode+port, e.g. sc.textFile( s"hdfs://mynamenodeservice/user/bla/blub.csv" ) – Frank Legler Apr 26 '16 at 19:27
  • No, I haven't found a solution, and your answer didn't solve my problem as I explained below. I'm leaving this question open because I want to know when Hadoop finally has this feature. – lyomi Oct 11 '17 at 16:34

1 Answers1

5

In this scenarion instead of checking for active namenode host and port combination, we should use nameservice as, nameservice will automatically transfer client requests to active namenode.

Name service acts like a proxy among Namenodes, which always divert HDFS request to active namenode

Example: hdfs://nameservice_id/file/path/in/hdfs


Sample steps to create nameservice

In hdfs-site.xml file

Create a nameservice by adding an id to it(here nameservice_id is mycluster)

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
  <description>Logical name for this new nameservice</description>
</property>

Now specify namenode ids to determine namenodes in cluster

dfs.ha.namenodes.[$nameservice ID]:

<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value>
  <description>Unique identifiers for each NameNode in the nameservice</description>
</property>

Then link namenode ids with namenode hosts

dfs.namenode.rpc-address.[$nameservice ID].[$name node ID]

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value>machine1.example.com:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>machine2.example.com:8020</value>
</property>

There are so many properties involved to Configure Namenode HA properly with Nameservice

With this setup the HDFS url for a file will looks like this

hdfs://mycluster/file/location/in/hdfs/wo/namenode/host

Edit:

Applying properties with java code

Configuration conf = new Configuration(false);
conf.set("dfs.nameservices","mycluster");
conf.set("dfs.ha.namenodes.mycluster","nn1,nn2");
conf.set("dfs.namenode.rpc-address.mycluster.nn1","machine1.example.com:8020");
conf.set("dfs.namenode.rpc-address.mycluster.nn2","machine2.example.com:8020");

FileSystem fsObj =  FileSystem.get("relative/path/of/file/or/dir", conf);

// now use fsObj to perform HDFS shell like operations
fsObj ...
Community
  • 1
  • 1
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • I appreciate your time writing this, but I explicitly said **without requiring their XML configuration files**. XML configs sometimes can be extremely annoying, e.g. when dist-copying from one HA cluster to another. – lyomi Aug 21 '17 at 10:01
  • Have look at this answer, https://stackoverflow.com/a/35911455/1592191. We can apply properties with XML as well as java code. Key catch should be the properties we need to be modified. – mrsrinivas Aug 21 '17 at 10:37
  • I got everything set up. But when I try to access through nameservice. The app shows java.net.UnknownHostException: mycluster – Jerome Li Jun 29 '21 at 22:29
  • I also got this error java.net.UnknownHostException: mycluster, how to resolve HDFS servicename in job? – behrooz razzaghi May 31 '23 at 09:43
  • Isn't `dfs.client.failover.proxy.provider.NS_ID` required? I've got `UnknownHostException` without it. https://stackoverflow.com/q/35568042 – Winand Jul 18 '23 at 07:39