With HDFS or HFTP URI scheme (e.g. hdfs://namenode/path/to/file
) I can access HDFS clusters without requiring their XML configuration files. It is very handy when running shell commands like hdfs dfs -get
, hadoop distcp
or reading files from Spark like sc.hadoopFile()
, because I don't have to copy and manage xml files for all relevant HDFS clusters to all nodes that those codes might potentially run.
One drawback of this approach is that I have to use the active NameNode's hostname, otherwise Hadoop will throw an exception complaining that the NN is standby.
A usual workaround is to try one and then try another if any exception is caught, or to connect to ZooKeeper directly and parse the binary data using protobuf.
Both of these methods are cumbersome, when compared to (for example) mysql's loadbalance URI or ZooKeeper's connection string where I can just comma-separate all hosts in the URI and the driver automatically finds a node to talk to.
Say I have active and standby namenode hosts nn1
and nn2
. What is the simplest way to refer a specific path of the HDFS, which:
- can be used in command-line tools like
hdfs
,hadoop
- can be used in Hadoop Java API (and thus tools depending on it like Spark) with minimum configuration
- works regardless of which namenode is currently active.