Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

Question

as all know the heartbeat is a signal sent periodically in order to indicate normal operation of the node or synchronize with other parts of the system

in our system we have 5 workers machine , while executes run on 3 of them

our system include 5 datanodes machines ( workers ) , and 3 master machines , hadoop version is 2.6.4 , all machines are redhat machine version 7.x

and thrift server install on the first master1 machine ( and driver is in master1 )

In Spark the heartbeats are the messages sent by executors ( from workers machines ) to the driver.( master1 machine ) the message is represented by case class org.apache.spark.Heartbeat

The message is then received by the driver through org.apache.spark.HeartbeatReceiver#receiveAndReply(context: RpcCallContext) method. The driver:

the main purpose of heartbeats consists on checking if given node is still alive ( from worker machine to master1 machine )

The driver verifies it at fixed interval (defined in spark.network.timeoutInterval entry) by sending ExpireDeadHosts message to itself. When the message is handled, the driver checks for the executors with no recent heartbeats.

until now I explain the concept

We notice that the messages sent by the executor can not be delivered to the driver , and from the yarn logs we can see that warning

WARN executor.Executor: Issue communicating with driver in heartbeater

My question is - what could be the reasons that driver ( master1 machine ) not get the heartbeat from the workers machines

Did you find an answer for this ? – Luis Leal Mar 19 '20 at 00:31 — Luis Leal, Mar 19 '20 at 00:31

Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

0 Answers0