If we are running multi-paxos then a node may see:
Propose(N)
Accept!(N,Vn)
Accept!(N+1,Vm)
Accept!(N+4,Vo) // huh? where is +2, +3?
Accept!(N+5,Vp)
This may be because either:
- There was a stable leader but the network local to this node dropped else delayed +2 and +3.
- There was an outage such that there were two attempts to propose such that +2 and +3 were failed rounds proposals
In general operations on the distributed finite state machine wont commute such that a node should apply all operations in order. This implies that a node needs to be able to distinguish between the two cases. If it is failed rounds of proposals the node has no problems. If it is lost messages it suggests that the node should wait till they turn up else try to recover the lost data (e.g. request a snapshot to reinitialise and catchup).
What are the options or strategies to handle this and what overhead do they create?
This question is inspired by In Paxos, can an Acceptor accept a different value after it has already accepted one?