2

There's something I'm trying to understand about failure detection and the gossip of the detection.

A multi-jvm test I wrote seems to show that a member that has been detected as failed but still reachable via other nodes will not receive an UnreachableMember(self) event, telling him that he has been detected as unreachable by some member.

The test went as follows: In a 3 node cluster, node2 detects node3 as unreachable, node3 detects node2 as unreachable. Then node1 receives the information that node2 and node3 are unreachable (through gossiping). However, node2 and node3 never receive the fact that they were themselves detected as unreachable.

After some digging in the GossipSpec I found out that the "falsely" unreachable member will reach convergence even though it is noted as Unreachable (see the test "not reach convergence when unreachable").

Is that the reason why I'm not seeing the UnreachableMember(self) event?

The ultimate goal I'm trying to get at is to detect that a member has been falsely found unreachable since it is unreachable from that given member due to a faulty channel between both.

DennisVDB
  • 1,347
  • 1
  • 14
  • 30

0 Answers0