2

Why a separate phase of Leader Election is required in the paper Paxos for System Builders: An Overview instead of using the prepare phase for the leader election? What advantages does this additional phase provide over using the implicit prepare phase?

simbo1905
  • 6,321
  • 5
  • 58
  • 86
vaibhav
  • 21
  • 1
  • What is “system builder”? Perhaps you can edit the question to provide a link to it or be more specific so that I can find it on a search engine? – simbo1905 Dec 03 '18 at 07:48
  • @simbo1905: You can find the description in this link: http://www.cnds.jhu.edu/pub/papers/psb_ladis_08.pdf. Sorry for the trouble. – vaibhav Dec 04 '18 at 15:02
  • i have edited your question to include the link to the paper so that folks can have a read – simbo1905 Dec 04 '18 at 15:20
  • a very good question. it is a common misunderstanding that paxos doesn't know how to do leader election https://stackoverflow.com/a/46012211/329496 but you already know that it does. personally i like to do it exactly like Paxos Made Simple does as per https://github.com/trex-paxos/trex/wiki/Leader-Elections yet someone much smater than me likes to put in extra states for leader election that they have developed a formal proof of correctness at https://simbo1905.blog/2017/08/22/pre-voting-in-distributed-consensus/ I look forward to reading the paper and answering. thanks for pointing it out. – simbo1905 Dec 04 '18 at 15:28

1 Answers1

1

You are correct that there isn't any need to implement a specific leader election algorithm in Paxos as described in Paxos Made Simple. You simply have to randomise timeouts on seeing progress from a leader and a stable leader will emerge. Nothing else is required you just resend messages and timeout on a leader and you issue a Prepare. I describe that in detail here.

Yet you can choose to use a custom leader election mechanism you prefer, e.g.

  1. One where you can prove that there will be no endless leader duels
  2. One optimised for mean time to recovery
  3. One optimised to guarantee that recovery happens within a time bound
  4. One optimised for simplicity and minimal code

The key point here is that you can optimise the speed of failure detection and how fast the clients experience a live system after failures. If you have fast failure detection your mean time to recovery will be lower. If leader duels are not possible that will help you guarantee that the system recovers inside some time-bound (e.g, of the number of message round trips).

Paxos For System Builders: An Overview states:

Paxos is only as live as its leader election protocol

Confirming the importance of picking a leader election protocol to ensure liveness. They also ask:

What liveness properties does [a Paxos implimentation] guarantee?

also

we also bring to light important theoretical differences, related to liveness, that arise from how one specifies the details of Paxos; specifically, our focus is on the choice of leader election algorithm

and

Each leader election protocol requires a different level of stability to remain on a single leader (which is needed to guarantee liveness).

It isn't until the end of the paper they say:

These details are critical components for building a high-performance, highly-available Paxos-based replication engine.

They are equating highly-available to liveness. Given that randomised timeouts make no guarantee about avoiding endless leader duels they are effectively ruled out of consideration in a paper as it is trying to optimise Paxos liveness.

Since the authors don't say why they picked the leader election protocol they did we are left guessing as to exactly why they think it is best. If we look at all their charts they are all throughput related and not mean time to recovery or maximum time to recovery. I find that disappointing. They miss an opportunity in the paper to share their insights as to why they picked the exact leader election protocol that they did. They could have used something like jepsen that tests real opensource databases under combinations of network partitions and crashes. They could have then provided experimental evidence that their leader election algorithm is better than the alternatives under different sort of crashes or network partitions.

Section 3 gives a strong hint as to their thinking:

we observed that there is a scale of network stability requirements for the overall systems, resulting directly from the choice of leader election protocol. Each leader election protocol requires a different level of stability to remain on a single leader (which is needed to guarantee liveness)

They also say:

Strong L1 requires that progress be made even in the face of a (rapidly) shifting majority. We believe that no Paxos- like algorithm will be able to meet this requirement. If the majority shifts too quickly, then it may never be stable long enough to complete the leader election protocol.

So they discuss both network stability and crash stability, and how different leader election algorithms perform differently based on the types of instability. They explicitly say that no Paxos algorithm gives liveliness with rapidly shifting partitions. So they are implicitly saying that they are optimising for liveness under reasonably stable network partitions.

That is a reasonable optimisation. Networks can and do strange things but not as often as application processes have disk errors, memory errors, or bugs, that get them into crash loops. You want to use Paxos to ensure that no matter how crazy or bizarre your network partitions are you don't get corruptions. Yet for leader elections, you can probably assume that the network is more stable than the servers running on it. You might pick a leader election mechanism that has a fast time to recovery for a reasonably stable network where processes are frequently crashing and restarting.

IMHO practical system builders should use random timeouts by default. Randomised timeouts are the simplest thing that can work but they make no guarantee that they won't be endless leader duels. It is just improbable. Yet randomised timeouts are very attractive in terms of simplicity and minimal code and so is less likely to be buggy. This is why the Raft algorithm uses them. There are huge numbers of practical algorithms that are stochastic in nature.

Whether practically you can live with randomised timeouts as the simplest thing that can work depends on your use case. It is a very typical use case to only replicate meta-data, not the main data, using a strong consistency algorithm. There are many eventually consistent systems that rely upon a strongly consistent configuration but have high-performance direct writes. So it is often possible to build systems where the Paxos leader has crashed but the system is still running and accepting reads or writes with varying guarantees. With such systems using simple random timeouts for leader elections with exponential back-off may be "good enough". Corfu is an example of a high performance strongly consistent engine that uses Paxos to maintain a mapping of what write ranges are replicated to which nodes. Yet clients write directly to multiple nodes using chain replication not via the leader. It’s only guaranteed to be correct if every node has a consistent view of cluster membership. Yet that changes slowly via Paxos and high volume Corfu writes could continue during a Paxos leader election so randomised timeouts could be good enough.

One could say "well, if I can use a custom leader election to recovery faster then I will". Yet that custom mechanism is more code to be written and debugging. So I would say that if you can live with random timeouts and not a sophisticated leader election mechanism then you probably should.

simbo1905
  • 6,321
  • 5
  • 58
  • 86
  • I understand your point that there is a misconception that Paxos itself cannot be used for Leader Election. But, I think the people involved in Paxos for System Builder are quite familiar with Paxos and have used separate phase of Leader Election purposely. – vaibhav Dec 07 '18 at 03:05
  • i have taken a harder look at exactly what they wrote and revised my answer to be more explicit – simbo1905 Dec 07 '18 at 22:26
  • @vaibhav i made some updates based on a more detailed analysis of the paper. – simbo1905 Dec 12 '18 at 13:26