Spark Worker not joining Master after Master dies and comes back

Question

I was wondering on how often does Worker pings Master to check on Master's liveness? Or is it the Master (Resource manager) that pings Workers to check on their liveness and if any workers are dead to spawn ? Or is it both?

Some info: Standalone cluster 1 Master - 8core 12Gb 32 workers - each 8 core and 8 Gb

My main problem - Here's what happened:

Master M - running with 32 workers Worker 1 and 2 died at 03:55:00 - so now the cluster is 30 workers

Worker 1' came up at 03:55:12.000 AM - it connected to M Worker 2' came up at 03:55:16.000 AM - it connected to M

Master M dies at 03:56.00 AM New master NM' comes up at 03:56:30 AM Worker 1' and 2' - DO NOT connect to NM Remaining 30 workers connect to NM.

So NM now has 30 workers.

I was wondering on why those two won't connect to new master NM even though master M is dead for sure.

PS:I have a LB setup for Master which means that whenever a new master comes in LB will start pointing to new one.

score 1 · Answer 1 · edited May 23 '17 at 10:30

Load balancer won't resolve your problem here. For Spark workers to recognize a new master you have to configure Spark in a high availability mode. Spark standalone supports two HA configurations:

Standby master with ZooKeeper.
Node recovery using file system.

The latter solution is much simpler but requires a reliable, distributed file system to store spark.deploy.recoveryDirectory, unless you recover master on the same node of course.

Recovery mode can be configured using spark.deploy.recoveryMode property (NONE by default) which should be set to ZOOKEEPER and FILESYSTEM for standby and node recovery respectively.

More details can be found in High Availability documentation.

Related: What happens when Spark master fails?

Spark Worker not joining Master after Master dies and comes back

1 Answers1