There's something I'm trying to understand about failure detection and the gossip of the detection.
A multi-jvm test I wrote seems to show that a member that has been detected as failed but still reachable via other nodes will not receive an UnreachableMember(self)
event, telling him that he has been detected as unreachable by some member.
The test went as follows:
In a 3 node cluster, node2
detects node3
as unreachable, node3
detects node2
as unreachable. Then node1
receives the information that node2
and node3
are unreachable (through gossiping). However, node2
and node3
never receive the fact that they were themselves detected as unreachable.
After some digging in the GossipSpec
I found out that the "falsely" unreachable member will reach convergence even though it is noted as Unreachable
(see the test "not reach convergence when unreachable"
).
Is that the reason why I'm not seeing the UnreachableMember(self)
event?
The ultimate goal I'm trying to get at is to detect that a member has been falsely found unreachable since it is unreachable from that given member due to a faulty channel between both.