What to do if the leader fails in Multi-Paxos for master-slave systems?

Question

Backgound:

In section 3, named Implementing a State Machine, of Lamport's paper Paxos Made Simple, Multi-Paxos is described. Multi-Paxos is used in Google Paxos Made Live. (~~Multi-Paxos is used in Apache ZooKeeper~~). In Multi-Paxos, gaps can appear:

In general, suppose a leader can get α commands ahead--that is, it can propose commands i + 1 through i + α commands after commands 1 through i are chosen. A gap of up to α - 1 commands could then arise.

Now consider the following scenario:

The whole system uses master-slave architecture. Only the master serves client commands. Master and slaves reach consensus on the sequence of commands via Multi-Paxos. The master is the leader in Multi-Paxos instances. Assume now the master and two of its slaves have the states (commands have been chosen) shown in the following figure:

.

Note that, there are more than one gaps in the master state. Due to asynchrony, the two slaves lag behind. At this time, the master fails.

Problem:

What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?

In particular, how to handle with the gaps and the missing commands with respect to that of the old master?

Update about Zab:

As @sbridges has pointed out, ZooKeeper uses Zab instead of Paxos. To quote,

Zab is primarily designed for primary-backup (i.e., master-slave) systems, like ZooKeeper, rather than for state machine replication.

It seems that Zab is closely related to my problems listed above. According to the short overview paper of Zab, Zab protocol consists of two modes: recovery and broadcast. In recovery mode, two specific guarantees are made: never forgetting committed messages and letting go of messages that are skipped. My confusion about Zab is:

In recovery mode does Zab also suffer from the gaps problem? If so, what does Zab do?

hailinzeng · Answer 1 · 2013-10-07T14:41:41.777

2

The gap should be the Paxos instances that has not reached agreement. In the paper Paxos Made Simple, the gap is filled by proposing a special “no-op” command that leaves the state unchanged.

If you cares about the order of chosen values for Paxos instances, you'd better use Zab instead, because Paxos does not preserve causal order. https://cwiki.apache.org/confluence/display/ZOOKEEPER/PaxosRun

The missing command should be the Paxos instances that has reached agreement, but not learned by learner. The value is immutable because it has been accepted a quorum of acceptor. When you run a paxos instance of this instance id, the value will be chosen and recovered to the same one on phase 1b.

When slaves/followers detected a failure on Leader, or the Leader lost a quorum support of slaves/follower, they should elect a new leader.

In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO.

In recovery mode, after the leader is elected, the follower synchronize with leader first, and apply the modification on state until NEWLEADER is received.

In broadcast mode, the follower queues the PROPOSAL in pendingTxns, and wait the COMMIT in the same order. If the zxid of COMMIT mismatch with the zxid of head of pendingTxns, the follower will exit.

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0

edited Oct 07 '13 at 14:41

answered Oct 07 '13 at 14:29

hailinzeng

966
9
24

Multi-paxos in and of itself doesn't preserve causal order because it is a protocol that doesn't care about the payload (which is important to prevent poison pills). It is easy to add causal order guarantees to the layer above paxos. – Michael Deardeuff Oct 07 '13 at 17:41
@MichaelDeardeuff Would you please explain your opinion as an answer? In particular, what do you mean Multi-Paxos is a protocol *that doesn't care about the payload*? And how to add causal order guarantees upon Multi-Paxos? – hengxin Oct 09 '13 at 01:52
The Paxos-like protocols don't look into the value sent, just the instance and round ids. In my experience this is a Good Thing because it allows the core consensus pieces of the system to be separated from the business logic. It also allows you to change consensus protocols much more easily, as my team has done recently. We've added causal order guarantees *above* the consensus layer by doing optimistic concurrency checks, much like software transactional memory. That is, if any of the keys have been written to since the paxos instance we've read from, the transaction is aborted. – Michael Deardeuff Oct 10 '13 at 17:06
I would suggest reading up on Raft over Zab. I've only heard good things about Raft, and only bad things about Zab and Zookeeper. – Michael Deardeuff Mar 26 '14 at 05:24
Here is an explanation of how causal ordering works with the algorithm as described by Lamport https://simbo1905.wordpress.com/2014/10/28/transaction-log-replication-with-paxos/ Whether operations commute or not is a function of the application. If you assume "no" for the generic case of log replication you simply track whether you have committed the prior slots or not. Thats a family simple check before you pass the chosen commands in order to the learner. That is described that post and the code on GitHub which demonstrates the approach. – simbo1905 Mar 10 '17 at 15:59
The link to the ZAB confluence saying "paxos doesn't support causal ordering" is simply a link to someone's buggy understanding of how to implement Paxos for state machine replication with causal ordering. Signing up to that confluence there is no way to comment on it to point out that it is an erroneous "straw man argument". Its actually quite embarrassing for the author of that page to be saying that they have found a seminal algorithm when what they are really showing is that they are unable to apply the algorithm to the most typical use case. – simbo1905 Mar 10 '17 at 16:04

score 1 · Answer 2 · answered Oct 07 '13 at 01:07

1

Multi-Paxos is used in Apache ZooKeeper

Zookeeper uses zab, not paxos. See this link for the difference.

In particular, each zookeeper node in an ensemble commits updates in the same order as every other nodes,

Unlike client requests, state updates must be applied in the exact original generation order of the primary, starting from the original initial state of the primary. If a primary fails, a new primary that executes recovery cannot arbitrarily reorder uncommitted state updates, or apply them starting from a different initial state.

answered Oct 07 '13 at 01:07

sbridges

24,960
4
64
71

Thanks for your comment. I have corrected it. And zab seems to be more related to my problem. I'll study it. – hengxin Oct 07 '13 at 02:55
Does Zab in ZooKeeper also suffer from the gaps problem illustrated in the figure? If so, what does Zab do? Any suggestions or references? Thanks. – hengxin Oct 07 '13 at 09:16
I wouldn't look into zab, I would look into raft ( http://www.youtube.com/watch?v=YbZ3zDzDnrw ). And by the way, both raft, zab, and multi-paxos can be reduced to vertical paxos; but raft is much easier to understand, (or so I've heard; I find them all quite simple.) – Michael Deardeuff Oct 07 '13 at 18:06

score 1 · Answer 3 · edited May 23 '17 at 12:17

Specifically the ZAB paper says that a newly elected leader undertakes discovery to learn the next epoch number to set and who has the most up-to-date commit history. The follower sands an ACK-E message which states the max contiguous zxid it has seen. It then says that it undertakes a synchronisation phase where it transmits the state which followers which they have missed. It notes that in interesting optimisation is to only elect a leader which has a most up to date commit history.

With Paxos you don't have to allow gaps. If you do allow gaps then the paper Paxos Made Simple explains how to resolve them from page 9. A new leader knows the last committed value it saw and possibly some committed values above. It probes the slots from the lowest gap it knows about by running phase 1 to propose to those slots. If there are values in those slots it runs phase 2 to fix those values but if it is free to set a value it sets no-op value. Eventually it gets to the slot number where there have been no values proposed and it runs as normal.

In answer to your questions:

What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?

They should attempt to lead after a randomised delay to try to reduce the risk of two candidates proposing at the same time which would waste messages and disk flushes as only one can lead. Randomised leader timeout is well covered in the Raft paper; the same approach can be used for Paxos.

In particular, how to handle with the gaps and the missing commands with respect to that of the old master?

The new leader should probe and fix the gaps to either the highest value proposed to that slot else a no-op until it has filled in the gaps then it can lead as normal.

score 0 · Answer 4 · answered Oct 09 '13 at 02:15

The answer of @Hailin explains the gap problem as follows:

In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO"

To supplement:

In the paper A simple totally ordered broadcast protocol, it mentions that ZooKeeper requires the prefix property:

If $m$ is the last message delivered for a leader $L$, any message proposed before $m$ by $L$ must also be delivered".

This property mainly relies on the TCP mechanism used in Zab. In Zab Wiki, it mentions that the implementation of Zab must follow the following assumption (besides others):

Servers must process packets in the order that they are received. Since TCP maintains ordering when sending packets, this means that packets will be processed in the order defined by the sender.

i am not sure that one can state that tcp eventually gets the packets correct; it can just drop the stateful connection throwing an error to the sender. the master needs to retransmit when the connection drops by keeping track of which messages have been acknowledge. the other node can spot duplications on retransmissions due to the zid. — simbo1905, Nov 01 '14 at 12:31

What to do if the leader fails in Multi-Paxos for master-slave systems?

4 Answers4