I've read Lamport's paper on Paxos. I've also heard that it isn't used much in practice, for reasons of performance. What algorithms are commonly used for consensus in distributed systems?

- 57,710
- 92
- 283
- 453

- 14,289
- 5
- 49
- 99
-
2Paxos is used by very (very) large services at Microsoft and Google... – Nathan Howell Jan 04 '10 at 06:40
-
1Yes, but I'm sold that these aren't the basic Paxos that we learned in school, but variants. I'm curious as to exactly what variants of Paxos are in use. – Rob Lachlan Jan 04 '10 at 07:19
-
1Google published a paper about their Paxos implementation: http://labs.google.com/papers/paxos_made_live.pdf, Microsoft Research (Leslie Lamport, Paxos inventor) has a little info also: http://research.microsoft.com/en-us/labs/siliconvalley/groups/distsys.aspx. I think you'll find the actual production versions to be close to standard Paxos. – Nathan Howell Jan 04 '10 at 21:33
10 Answers
Not sure if this is helpful (since this is not from actual production information), but in our "distributed systems" course we've studied, along with Paxos, the Chandra-Toueg and Mostefaoui-Raynal algorithms (of the latter our professor was especially fond).

- 26,231
- 8
- 93
- 152
-
Found [this paper](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.9126) saying that Chandra-Toueg is roughly the same efficiency as Paxos – Dan Apr 30 '13 at 04:58
-
Another [interesting read](http://ddg.jaist.ac.jp/pub/US04b.pdf) summarizes that Mostefaoui-Raynal scales much worse in some cases (including cases where broadcast messages must be simulated as many point-to-point messages, as in most computer networks). However, it can react to failures much faster for communication architectures where true broadcast messages are available. Also, for 3 or fewer nodes, it has one fewer communication step, so in that case it will outperform Chandra-Toueg regardless of communication architecture. – Dan Apr 30 '13 at 05:22
Check out the Raft algorithm for a consensus algorithm that is optimized for ease of understanding and clarity of implementation. Oh... it is pretty fast as well.
https://ramcloud.stanford.edu/wiki/display/logcabin/LogCabin
https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

- 1,877
- 15
- 12
The Paxos system I run (which supports really, really big web sites) is halfway in-between Basic-Paxos Multi-paxos. I plan on moving it to a full Multi-Paxos implementation.
Paxos isn't that great as a high-throughput data storage system, but it excels in supporting those systems by providing leader election. For example, say you have a replicated data store where you want a single master for performance reasons. Your data store nodes will use the Paxos system to choose the master.
Like Google Chubby, my system is run as a service and can also store data as configuration container. (I use configuration loosely; I hear Google uses Chubby for DNS.) This data doesn't change as often as user input so it doesn't need high throughput write SLAs. Reading, on the other hand, is extremely quick because it is fully replicated and you can read from any node.
Update
Since writing this, I have upgraded my Paxos system. I am now using a chain-consensus protocol as the primary consensus system. The chain system still utilizes Basic-Paxos for re-configuration—including notifying chain nodes when the chain membership changes.

- 10,386
- 5
- 51
- 74
-
-
@simbo1905 Sadly, I'm still waiting for legal approval to publish the paper--though we have had it peer reviewed. In the mean time you can look at [Chain Replication](http://static.usenix.org/legacy/events/osdi04/tech/full_papers/renesse/renesse.pdf) by Robert van Renesee and Vertical Paxos by Lammport – Michael Deardeuff Oct 24 '14 at 08:10
If performance is an issue, consider whether you need all of the strong consistency guarantees Paxos gives you. See e.g. http://queue.acm.org/detail.cfm?id=1466448 and http://incubator.apache.org/cassandra/. Searching on Paxos optimised gets me hits, but I suspect that relaxing some of the requirements will buy you more than tuning the protocol.

- 19,301
- 2
- 19
- 25
-
I'm aware of the whole BASE approach, but there are situations which require ACID guarantees, and for those situation we need something like Paxos. But you're right, strict consistency isn't right for everything. – Rob Lachlan Jan 04 '10 at 20:06
Paxos is optimal in terms of performance of consensus protocols, at least in terms of the number of network delays (which is often the dominating factor). It's clearly not possible to reliably achieve consensus while tolerating up to f failures without a single round-trip communication to at least (f-1) other nodes in between a client request and the corresponding confirmation, and Paxos achieves this lower bound. This gives a hard bound on the latency of each request to a consensus-based protocol regardless of implementation. In particular, Raft, Zab, Viewstamped Replication and all other variants on consensus protocols all have the same performance constraint.
One thing that can be improved from standard Paxos (also Raft, Zab, ...) is that there is a distinguished leader which ends up doing more than its fair share of the work and may therefore end up being a bit of a bottleneck. There is a protocol known as Egalitarian Paxos which spreads the load out across multiple leaders, although it's mindbendingly complicated IMO, is only applicable to certain domains, and still must obey the lower bound on the number of round-trips within each request. See the paper "There Is More Consensus in Egalitarian Parliaments" by Moraru et al for more details.
When you hear that Paxos is rarely used due to its poor performance, it is frequently meant that consensus itself is rarely used due to poor performance, and this is a fair criticism: it is possible to achieve much higher performance if you can avoid the need for consensus-based coordination between nodes as much as possible, because this allows for horizontal scalability.
Snarkily, it's also possible to achieve better performance by claiming to be using a proper consensus protocol but actually doing something that fails in some cases. Aphyr's blog is littered with examples of these failures not being as rare as you might like, where database implementations have either introduced bugs into good consensus-like protocols by way of "optimisation", or else developed custom consensus-like protocols that fail to be fully correct in some subtle fashion. This stuff is hard.

- 1,846
- 16
- 27
You should check the Apache Zookeeper project. It is used in production by Yahoo! and Facebook among others.
http://hadoop.apache.org/zookeeper/
If you look for academic papers describing it, it is described in a paper at usenix ATC'10. The consensus protocol (a variant of Paxos) is described in a paper at DSN'11.

- 11
- 1
Google documented how they did fast paxos for their megastore in the following paper: Link.

- 46
- 3
With Multi-Paxos when the leader is galloping it can respond to the client write when it has heard that the majority of nodes have written the value to disk. This is as good and efficient as you can get to maintain the consistency guarantees that Paxos makes.
Typically though people use something paxos-like such as zookeeper as an external service (dedicated cluster) to keep critical information consistent (who has locked what, who is leader, who is in a cluster, what's the configuration of the cluster) then run a less strict algorithm with less consistency guarantees which relies upon application specifics (eg vector clocks and merged siblings). The short ebook distributed systems for fun and profit as a good overview of the alternatives.
Note that lots of databases compete on speed by using risky defaults which risk consistency and can loose data under network partitions. The Aphry blog series on Jepson shows whether well know opensouce systems loose data. One cannot cheat the CAP Theorem; if you configure systems for safety then they end up doing about the same messaging and same disk writes as paxos. So really you cannot say Paxos is slow you have to say "a part of a system which needs consistency under network partitions requires a minimum number of messages and disk flushes per operation and that is slow".

- 6,321
- 5
- 58
- 86
There are two general blockchain consensus systems:
- Those that produce unambiguous 100% finality given a defined set of validators
- Those which do not provide 100% finality but instead rely on high probability of finality.
The first generation blockchain consensus algorithms (Proof of Work, Proof of Stake, and BitShares’ Delegated Proof of Stake) only offer high probability of finality that grows with time. In theory someone could pay enough money to mine an alternative “longer” Bitcoin blockchain that goes all the way back to genesis.
More recent consensus algorithms, whether HashGraph, Casper, Tendermint, or DPOS BFT
all adopt long-established principles of Paxos
and related consensus algorithms. Under these models it is possible to reach unambiguous finality under all network conditions so long as more than ⅔ of participants are honest.
Objective and unambiguous 100% finality is a critical property for all blockchains that wish to support inter-blockchain communication. Absent 100% finality, a reversion on one chain could have irreconcilable ripple effects across all interconnected chains.
The abstract protocol for these more recent protocols
involves:
- Propose block
- All participants acknowledge block (pre-commitment)
- All participants acknowledge when ⅔+ have sent them pre-commitments (commitment)
- A block is final once a node has received ⅔+ commitments
- Unanimous agreement on finality is guaranteed unless ⅓+ are bad and evidence of bad behavior is available to all
It is the technical differences in the protocols that give rise to real-world impact on user experience. This includes things such as latency until finality, degrees of finality, bandwidth, and proof generation / validation overhead.
Look for more details on delegated proof of stake by eos here

- 934
- 2
- 13
- 34
Raft is more understandable, and faster alternative of Paxos. One of the most popular distributed systems which uses Raft is Etcd. Etcd is the distributed store used in Kubernetes.
It's equivalent to Paxos in fault-tolerance.

- 10,309
- 6
- 39
- 55