Mitigating Hadoop's Achilles tendons

Question

I just gave this Hadoop tuorial a read which state that Hadoop has an Achilles' tendon (a single point of failure) in JobTracker:

The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all running jobs are halted.

And also this article which states that the NameNode is a SPOF:

The single point of failure in a Hadoop cluster is the NameNode.

Single points of failure are bad, mkay? What strategies/techniques/tooling/etc. can be done to circumvent these SPOFs and make Hadoop redundant, faul-tolerant and resilient (buzzword alert!)?

Shyam · Answer 1 · 2015-06-30T07:58:57.480

1

There are High availability mechanisms built into Hadoop for a while. 'Secondary NameNode', 'Backup JobTracker' will serve as a hot backup of their respective counterparts.

Most of the past 'SPOF' has been eliminated with recent hadoop releases.

It is explained in the following docs in depth.

Hope it helps.

edited Jun 30 '15 at 07:58

answered Jun 30 '15 at 07:43

Shyam

516
3
7

Thanks @Shyam (+1) - I will read the links when I get a chance, but they seem pretty beefy. On SO, it is generally frowned upon to answer a question by just referencing a link to documentation. For the full bounty, any chance you can edit your answer to contain a high-level summary of the main HA strategies behind "Secondary NameNode" and "Backup JobTracker"? Thanks again! – smeeb Jun 30 '15 at 14:00
I was not really after the bounty. :) But I take your point, I'll update my answer when I have some free time. – Shyam Jul 01 '15 at 02:38

score 1 · Accepted Answer · answered Jul 01 '15 at 05:32

HDFS and Mapreduce are the core components in Hadoop, In the earlier Apache Hadoop releases, Namenode and Jobtracker were SPOF (Only one instance can be configured). This problem is fixed from Hadoop 2.X.

Jobtracker HA.

Jobtracker HA can be achived by configuring 2 Jobtracker(JT) instance in Active - Standby mode on two nodes. If one JT goes down, second Jobtracker will be available to serve the request. Only one jobtracker(Active) will be available for serving request at a time, second JT(Standby ) will be running in Read only mode. Jobtracker HA requires zookeeper instance, Failure over(switching) can be configured as either Manaul or Automcatic. Automatic failover requires another process called Failover Controller (FC). In the current release, if active JT fails, all running jobs will be halted, However new job will automatically be submitted to new JT. This functionality is not available in the current release.

MR2 is the second Generation of mapreduce which uses YARN, Resource Manager(RM) is the master service in YARN, RM can also be configured in Active-Standby mode. RM failure will not impact running Jobs/Application.

Namenode HA

Namenode HA is something important. Namenode HA can also be configured in Active-Standby mode(Maximum 2 namenode instances). Quorum based Journaling is the Widely accepted method, which internally uses zookeeper. Only one namenode will be active at a time.

Secondary Namenode(SNN) is not a Standby Namenode(SN) and vice versa, SNN has a different functionaly in Non HA configuration, Namenode HA set up doesn't require SNN, as SN namenode performs checkpointing (Functionality of SNN)

Processes Namenode HA

Active namenode
Standby namenode
Failover controller : For Fencing to avoid split-brain scenario.
Jounalnodes ( Min 3 instances required) : Namespace modfication will be logged to Journal nodes and Standby namenode reads from there. Only one namenode will be allowed to write at a time inorder to avoide split-brain issue.

Mitigating Hadoop's Achilles tendons

2 Answers2