3

When a job is running in the cluster, if suddenly the NameNode fails, then what will be the status of the job (failed or killed)?

If failed means, who is updating the job status?

How does this work internally?

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
Venkadesh Venkat
  • 175
  • 2
  • 7
  • 17

2 Answers2

2

Standby Namenode will become active Namenode with fail over process. Have a look at How does Hadoop Namenode failover process works?

YARN architecture revolves around Resource Manager, Node Manager and Applications Master. Jobs will continue without any of impact with namenode failure. If any of above three processes fails, job recovery will be done depending on respective process recovery.

Resource Manager recovery:

With the ResourceManger Restart enabled, the RM being promoted (current standby) to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM.

Application Master recovery:

For MapReduce running on YARN (aka MR2), the MR ApplicationMaster plays the role of a per-job jobtracker. MRAM failure recovery is controlled by the property, mapreduce.am.max-attempts. This property may be set per job. If its value is greater than 1, then when the ApplicationMaster dies, a new one is spun up for a new application attempt, up to the max-attempts. When a new application attempt is started, in-flight tasks are aborted and rerun but completed tasks are not rerun.

Node Manager Recovery:

During the recovery, the NM loads the applications’ state from the state store. The state for each application indicates whether the application has finished or not. Note that for a finished application no more containers will be launched but it may still be undergoing log- aggregation. As each application is recovered, a new Application object is created and initialization events are triggered to reinitialize the bookkeeping for the application within the NM.

During all these phases, Job History plays a critical role. Successfully completed Map & Reduce tasks status will be restored from Job History Server. This status is helpful to stop re-launch of successfully completed Map/Reduce tasks.

Have a look at Resource Manager HA article , Node Manager restart article and YARN HA article

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
0

I'm not completely sure of the following since I haven't tested it out. But it can't hurt to fire up a VM and test it out for yourself.

The namenode does not handle the status of jobs, that's what Yarn is doing. If the namenode is not HA and it dies, you will lose your connection to HDFS (and maybe even have data loss). yarn will try to re-contact hdfs for a few tries by default and eventually time out and fail the job.

Havnar
  • 2,558
  • 7
  • 33
  • 62