17

What happens when leader crashes before all followers updates the commit index?

For example, node A, B, C forms the cluster:

  • only A and B alive and A is leader

  • A replicates an entry (let's say it's entry1) to B and get successful result from B

  • A commits entry1, and crashes before it send out the heartbeat message to B (which would results in B updates its commit index)

  • C online now

My questions is:

  • would C be elected as new leader? If so, then entry1 is lost? Moreover, if A re-joins later, its data would be inconsistent with others?

I know the raft spec said:

Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader.

But here entry1 may be not considered as committed entry? because B has not get the confirm from old leader (heartbeat from leader). So C get chances to be new leader?

  • if B becomes new leader, then how it should deal with the entry1?
kingluo
  • 1,679
  • 1
  • 13
  • 31

1 Answers1

19

It's important to note that an entry is considered committed once it's stored on a majority of servers in the cluster (there are technically some caveats to this but for this conversation we should assume this is the case) and not when a node receives a commit messages from the leader. If a commit message were required to consider an entry committed then every commit would require two round-trips - one for replication and one for commitment - and commit indexes would have to be persisted.

Instead, in your scenario, when A crashes and C recovers the Raft election algorithm will ensure that C cannot be elected leader and so C cannot drop the committed entry. Only B can be elected leader since it has the most up-to-date log. If C tries to get elected leader, it will receive only a rejected vote from B since B's log is more up-to-date than C's (it has the committed entry). Thus, what you'll see in practice is B will eventually be elected and will commit all entries from its prior term, at which time that entry will still be committed. Even if B were then to crash and A were to recover, A would still have a more up-to-date log than C and so it would again be elected leader.

When (not if) B becomes the leader, it will first ensure entries from the prior term are stored on a majority of servers before committing any entry from its current term. Typically this is done by committing a no-op entry at the beginning of the new leader's term. Essentially, the new leader commits a no-op entry, and once that entry is stored on a majority of servers it increments its commit index and sends the new commit index to all followers. So, that entry will not be lost. The new leader will ensure it is committed.

The caveats to considering an entry stored on a majority of the cluster to be committed are described in both the Raft paper and disseration.

kuujo
  • 7,785
  • 1
  • 26
  • 21
  • Thanks for your answer! I am still confused about the 5.4.1 chapter. It said "Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries.", but then it turns to define "up-to-date" targeted at the last entry (even the entry is no way to be confirmed already replicated on most nodes). The later definition covers more entries then the previous one. Why not only state the "up-to-date" definition as the election restriction? – kingluo May 09 '16 at 07:38
  • In my case, if I changed the cluster to 5 nodes, and A crashes after A manages to replicate entry1 to B only, then which the rest node will become new leader? only B is qualified? because it has entry1 but C/D/E has not? But here entry1 has not yet replicated to most nodes (only A and B have entry1), then entry1 should not be considered as committed entry? – kingluo May 09 '16 at 07:41
  • In that scenario, if this is a 5 node cluster then any node can be elected. The quorum is what enforces commitment. Because the entry hasn't been stored on a majority of servers it's not committed. Since it's only stored on 2/5 nodes, regardless of which other node starts an election it can receive successful votes from a majority (3/5) of the cluster, so any node can win an election. If the entry were stored on a majority of servers then the majority requirement of elections would ensure that entry is present in any new leader's log. – kuujo May 11 '16 at 05:20
  • That is, if entry1 is only stored on two servers then any of the three other servers can be elected from votes entirely discounting the two servers on which entry1 is stored, and that's fine since that entry's not considered committed. In that case, if a leader without entry1 is elected it will essentially remove that entry from the cluster. If a leader with entry1 is elected then entry1 will end up being committed. This is where sessions become useful. When sessions are implemented, you can ensure an entry is applied to the state machine exactly once even if it's resubmitted by a client. – kuujo May 11 '16 at 05:22
  • Thanks! I get correct understanding now. I thought the log entry may be lost until the commit index updated from all (at least majority) followers, but it's not. In fact, if the log entry is actually replicated to a majority of servers, then it's persistent; otherwise, it may be lost. – kingluo May 11 '16 at 07:53