1

I have few basic questions regarding HDFS Federation.

Is it possible to read file created on one name node from another name node which is in the cluster federation?

Does current version of Hadoop supports this feature?

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • I am having the following requirement, can you please suggest is it feasible with Hadoop (HDFS ) or not and how. 1) We are having two datacenters named DCE and DCW in two distinct locations. Our requirement is user connected to one of the datacenters either DCE or DCW should be able to access his data which is existing in some other opposite datacenter. 2) We should be in a position to retrieve data existing in one of the datacenters in case of failure of that datacenter. My basic requirement is how we can replicate HDFS data in two datacenters – Ramanuja Dasan Nov 12 '15 at 07:34

3 Answers3

4

Let me explain how Name node federation works as per Apache web site

NameNode:

enter image description here

In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces.

The Namenodes are federated; the Namenodes are independent and do not require coordination with each other.

The Datanodes are used as common storage for blocks by all the Namenodes. Each Datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports. They also handle commands from the Namenodes.

enter image description here

In Summary,

Name nodes are mutually exclusive and does not require communication between them. Data nodes can be shared across multiple name nodes.

To answer your question, It's not possible. if the data is written one name node, you have to contact that name node only to fetch the data. You can't ask other name node.

Regarding your updated comments on data replication,

When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack - as per official documentation.

You can use this feature and get the data from other data centre if you have failures in local RAC. But note that you are reading data from one Federated Namenode and not from other Federated Namenode.

One Federated Namenode can't read data from other Federated Namenode. But they can share same set of Datanodes for read and write operations.

EDIT:

With-in each Federation, you can have automatic fail over of Namenode. If Active NameNode fails in a Federation, Stand-by Namenode will take over Active Namenode responsibilities.

Refer to below SE post for more details.

How does Hadoop Namenode failover process works?

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • about "You can't ask other name node." what happen if a name node goes down? Do we lost access to some files? How do we know the name node which knows where the file I'm looking is located ? – ozw1z5rd Oct 05 '16 at 18:16
  • If Active Namenode is down, Stand-by will become Active Namdenode. Refer to this post: http://stackoverflow.com/questions/33311585/how-does-hadoop-namenode-failover-process-works/33313804#33313804 – Ravindra babu Oct 05 '16 at 18:19
  • ok, but where the name node gets information ? On hadoop-HA the secondary name node is always in sync, in this case do we have a Stand-by node for each federated name node ? – ozw1z5rd Oct 05 '16 at 18:21
  • Exactly. You should have Stand-by name node. Just imagine Federation as group of Active NameNode/Stand-by NameNode and multiple DataNode clusters. – Ravindra babu Oct 05 '16 at 18:24
  • Tom White's Hadoop The Definitive Guide states that the block placement is local rack for first and different rack for the second and third replicas. The documentation tells the first and second are on local and third on different rack. I wasn't aware of this in the documentation. – ᐅdevrimbaris Dec 26 '16 at 16:25
  • Have a look at grepcode : http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/2.5.0/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java/?v=source – Ravindra babu Dec 26 '16 at 16:32
0

No. It is not possible to do that.

Tanveer Dayan
  • 496
  • 1
  • 7
  • 18
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/10161266) – Jithin Shaji Nov 11 '15 at 09:22
  • He asked if it is possible. I said it is not. Please review the question. :) – Tanveer Dayan Nov 11 '15 at 17:02
  • Thanks Jithin Shaji thanks for your reply. I am having the following requirement, can you please suggest is it feasible with Hadoop (HDFS ) or not and how. 1) We are having two datacenters named DCE and DCW in two distinct locations. Our requirement is user connected to one of the datacenters either DCE or DCW should be able to access his data which is existing in some other opposite datacenter. 2) We should be in a position to retrieve data existing in one of the datacenters in case of failure of that datacenter. My basic requirement is how we can replicate HDFS data in two datacenters. – Ramanuja Dasan Nov 12 '15 at 06:56
0

The default behaviour of the block replication policy in hadoop can be modified by extending the BlockPlacementPolicy interface and pointing the class to the dfs.block.replicator.classname property in the Hadoop configuration files.

Please research on BlockPlacementPolicy to get a better picture.

You can actually modify where your blocks can be placed in the cluster.

Tanveer Dayan
  • 496
  • 1
  • 7
  • 18
  • You don't have to use HDFS federation for this. HDFS federation is used only when the namenode data is too large to hold on a single namenode. – Tanveer Dayan Nov 12 '15 at 07:35
  • I am having the following requirement, can you please suggest is it feasible with Hadoop (HDFS ) or not and how. 1) We are having two datacenters named DCE and DCW in two distinct locations. Our requirement is user connected to one of the datacenters either DCE or DCW should be able to access his data which is existing in some other opposite datacenter. 2) We should be in a position to retrieve data existing in one of the datacenters in case of failure of that datacenter. My basic requirement is how we can replicate HDFS data in two datacenters – Ramanuja Dasan Nov 12 '15 at 10:08