TL;DR - BFT cluster with 4-5 notary nodes grinds to a halt when one replica is killed.
I ran the notary demo and the Raft cluster (with 3 notary nodes) behaved as expected - when I kill the leader, there's an election and the notary cluster continues to provide a reliable service.
I expect the same thing to happen when I run a BFT cluster (with 4 notary nodes) - killing one of the replicas should not stop the cluster from providing a reliable notary service. However, here is what happens:
1) Start the BFT notary cluster
2) I can notarise 10 transactions using gradlew samples:notary-demo:notarise
3) Stop one of the replicas in the cluster
4) Try to notarise 10 transactions using gradlew samples:notary-demo:notarise
5) Wait for a few minutes, nothing happens (transactions not notarised)
6) All of the remaining replicas terminals keep filling with re-connecting to replica 1 at /127.0.0.1:11010
Just to be on the safe side, I decided to add another notary node to the cluster. However, nothing changes - there are 5 notary nodes and killing one of them makes the cluster grind to a halt.
I looked into how BFT SMaRt works, but as far as I can tell, it should be able to tolerate any failures (including crash-stop) as long as there are enough working replicas (N >= 3f + 1).
Is there something I'm missing here? Is the behaviour that I'm expecting unreasonable - BFT cluster with 4-5 notary nodes being able to tolerate 1 node dying? Or is that an issue with Corda?