Kafka : Quorum-based approach to elect the new leader?

Question

There are two common strategies for keeping replicas in sync, primary-backup replication and quorum-based replication as stated here

In primary-backup replication, the leader waits until the write completes on every replica in the group before acknowledging the client. If one of the replicas is down, the leader drops it from the current group and continues to write to the remaining replicas. A failed replica is allowed to rejoin the group if it comes back and catches up with the leader. With f replicas, primary-backup replication can tolerate f-1 failures.

In the quorum-based approach, the leader waits until a write completes on a majority of the replicas. The size of the replica group doesn’t change even when some replicas are down. If there are 2f+1 replicas, quorum-based replication can tolerate f replica failures. If the leader fails, it needs at least f+1 replicas to elect a new leader.

I have a question about the statement If the leader fails, it needs at least f+1 replicas to elect a new leader in quorum based approach. My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do we need election instead of just simple any selection?

For election, how does zookeeper elect the final leader out of remaining replicas ? Does it compare which replica is latest updated ? Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?

senseiwu · Accepted Answer · 2019-04-16T08:59:43.503

Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?

In a quorum based system like zookeeper, a leader election requires a simple majority out of an "ensemble" - ie, nodes which form zK cluster. So for a 3 node ensemble, one node failure could be tolerated if remaining two were to form a new ensemble and remain operational. On the other hand, in a four node ensemble also, you need at-least 3 nodes alive to form a majority, so it could tolerate only 1 node failure. A five node ensemble on the other hand could tolerate 2 node failures.

Now you see that a 3 node or 4 node cluster could effectively tolerate only 1 node failure, so it make sense to have an odd number of nodes to maximise number of nodes which could be down for a given cluster.

zK leader election relies on a Paxos like protocol called ZAB. Every write goes through the leader and leader generates a transaction id (zxid) and assigns it to each write request. The id represent the order in which the writes are applied on all replicas. A write is considered successful if the leader receives the ack from the majority. An explanation of ZAB.

My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do we need election instead of just simple any selection?

As for why election instead of selection - in general, in a distributed system with eventual consistency, you need to have an election because there is no easy way to know which of the remaining nodes has the latest data and is thus qualified to become a leader.

In case of Kafka -- for a setting with multiple replicas and ISRs, there could potentially be multiple nodes with up-todate data that of the leader.

Kafka uses zookeeper only as an enabler for leader election. If a Kafka partition leader is down, Kafka cluster controller gets informed of this fact via zK and cluster controller chooses one of the ISR to be the new leader. So you can see that this "election" is different from that of a new leader election in a quorum based system like zK.

Which broker among the ISR is "selected" is a bit more complicated (see) -

Each replica stores messages in a local log and maintains a few important offset positions in the log. The log end offset (LEO) represents the tail of the log. The high watermark (HW) is the offset of the last committed message. Each log is periodically synced to disks. Data before the flushed offset is guaranteed to be persisted on disks.

So when a leader fails, a new leader is elected by following:

Each surviving replica in ISR registers itself in Zookeeper.
The replica that registers first becomes the new leader. The new leader chooses its Log End Offset(LEO) as the new High Watermark (HW).
Each replica registers a listener in Zookeeper so that it will be informed of any leader change. Everytime a replica is notified about a new leader: If the replica is not the new leader (it must be a follower), it truncates its log to its HW and then starts to catch up from the new leader. The leader waits until all surviving replicas in ISR have caught up or a configured time has passed. The leader writes the current ISR to Zookeeper and opens itself up for both reads and writes.

Now you can probably appreciate the benefit of a primary backup model when compared to a quorum model - using the above strategy, a Kafka 3 node cluster with 2 ISRs can tolerate 2 node failures -- including a leader failure -- at the same time and still get a new leader elected (though that new leader would have to reject new writes for a while till one of the failed nodes come live and catches up with the leader).

The price to pay is of course higher write latency - in a 3 node Kafka cluster with 2 ISRs, the leader has to wait for an acknowledgement from both followers in-order to acknowledge the write to the client (producer). Whereas in a quorum model, a write could be acknowledged if one of the follower acknowledges.

So depending upon the usecase, Kafka offers the possibility to trade durability over latency. 2 ISRs means you have sometimes higher write latency, but higher durability. If you run with only one ISR, then in case you lose the leader and an ISR node, you either have no availability or you can choose an unclean leader election in which case you have lower durability.

Update - Leader election and preferred replicas:

All nodes which make up the cluster are already registered in zK. When one of the node fails, zK notifies the controller node(which itself is elected by zK). When that happens, one of the live ISRs are selected as new leader. But Kafka has the concept of "preferred replica" to balance leadership distribution across cluster nodes. This is enabled using auto.leader.rebalance.enable=true, under which case controller will try to hand over leadership to that preferred replica. This preferred replica is the first broker in the list of ISRs. This is all a bit complicated, but only Kafka admin need to know about this.

Couple of follow-up question on answer 1. You said `Each surviving replica in ISR registers itself in Zookeeper` Do surviving replica registers only after leader fail or at the time broker start up ? 2. you said `The replica that registers first becomes the new leader` . Consider there are 3 ISR(I1,I2,I3) and 3 zookeeper(Z1,Z2,Z3) . Now I1 registers earliest with Z1, I2 registers earliest with Z2, I3 registers earliest with Z3 then which replica will be selected as leader ? continued.... — emilly, Apr 16 '19 at 01:16
.. Why not just have say two zookeepers(one leader and second follower so that we have high availability also) then leader zookeeper selects the leader based on criteria(like earliest registration) ? Why do we need voting of zookeepers here ? — emilly, Apr 16 '19 at 01:17
1, 2 and 3 - See my update. zK - idea of replication is always scalability and durability. If you have only one leader and one follower, a leader failure can result in running with only one node - if that node fails as well, there is complete loss of data and/or availability — senseiwu, Apr 16 '19 at 08:59
I am not clear on question 2 reply i.e. `Consider there are 3 ISR(R1,R2,R3) and 3 zookeeper(Z1,Z2,Z3) . Now R1 registers earliest with Z1, R2 registers earliest with Z2, R3 registers earliest with Z3 then which replica will be selected as leader here ? ` ? Will each zookeeper tell its preferred replica here and then the replica which get majority of votes will win here ? But if that is the case then each zookeeper can vote different replica out of R1/R2/R3 and no single replica will win here ? — emilly, Apr 16 '19 at 13:22
Z1,Z2,Z3 are "one system" in the eyes of a zK client such as Kafka. All writes to zK goes through zK leader. Kafka has no business singling out Z1, or Z2 or Z3 from each other (zK client library takes care of that) — senseiwu, Apr 16 '19 at 13:24
Ok get your point. Coming back to main question. Now why we need three zookeeper(zk) with here(Point A) and why all three need to be involved in voting(Point B). Lets take first Point B :- Any ZK can see which ISR broker was earliest registered and select that as leader. Why all three ZK need to vote and involved in this process ? — emilly, Apr 17 '19 at 09:07
Now Point A:- If we have 3 ZK with quorum philosophy , then it can stand only one ZK failure as quorum philosophy says to have the system up and running, it should have (3/2 +1 = 2) zk up. Even two simple ZK without any quorum can select the earliest registered ISR . In this case if first ZK gets down, second ZK can be continue as leader ZK. So it will also provide the support of one zk failure ? — emilly, Apr 17 '19 at 09:07
B) Nowhere I wrote that all 3 are *voting* in deciding which ISRs. You are confusing zK leader election with Kafka leader election probably. Kafka controller gets notified by zK of any changes in broker statuses and it is Kafka controller which performs actual ISR selection. There is NO other voting involved in Kafka ISR or leader election (A) zK works as quorum. If one node is down in 3 node cluster, remaining two form a quorum. I'd recommend that you understand basics of zK architecture from official doc. before extending comment discussion here. — senseiwu, Apr 17 '19 at 10:43
Thanks Senseiwu. A single statement `"Quorum" refers to the minimum number of nodes that must agree on a transaction before it is considered committed. ` from `http://zookeeper-user.578899.n2.nabble.com/Meaning-of-quorum-and-ensemble-td7581147.html` cleared me all the doubts in conjuction of your comment. Somehow I was assuming whereever quorum is involved voting/election is invloved but that's not true. With above statement I understand that quorum provides best of both worlds i.e High availability and better write latency. — emilly, Apr 17 '19 at 11:07
Apart from High availability and better write latency, it takes care of network partitioning aspect as well — emilly, Apr 17 '19 at 11:58

Kafka : Quorum-based approach to elect the new leader?

1 Answers1