35

What's the exact reason to have heartbeat failure for group because it's rebalancing ? What's the reason for rebalance where all the consumers in group are up ?

Thank you.

AloneArtifact
  • 475
  • 1
  • 5
  • 8

1 Answers1

49

Heartbeats are the basic mechanism to check if all consumers are still up and running. If you get a heartbeat failure because the group is rebalancing, it indicates that your consumer instance took too long to send the next heartbeat and was considered dead and thus a rebalance got triggered.

If you want to prevent this from happening, you can either increase the timeout (session.timeout.ms), or make sure your consumer sends heartbeat more often (heartbeat.interval.ms). Heartbeats are basically embedded in poll(), thus, you need to make sure you call poll frequently enough. This can usually be achieved by limit the number of records a single poll returns via max.poll.records (to shorten the time it takes to process all data that got fetched).

Update

Since Kafka 0.10.1, heartbeats are sent in a background thread, and not when poll() is called (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread). In this new design, configuration session.timeout.ms and heartbeat.interval.ms are still the same. Additionally, there is max.poll.interval.ms that determines how often poll() must be called. If you miss to call poll() within max.poll.interval.ms, the heartbeat thread assume that the processing thread died, and will send a leave-group-request that will trigger a rebalance, and the heartbeat thread will stop sending heartbeats afterwards. If you processing thread is ok but just slow, the next call to poll() will initiate another rebalance to re-join the group again.

For more details, cf. Difference between session.timeout.ms and max.poll.interval.ms for Kafka >= 0.10.1

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Can you tell me what should be done in the updated case to avoid the same? – Nobita Mar 19 '20 at 16:12
  • Not sure what you mean? Maybe ask a new question and elaborate in more detail? – Matthias J. Sax Mar 19 '20 at 17:53
  • Can you have a look at this: https://stackoverflow.com/questions/60753274/kafka-consumer-group-rebalancing – Nobita Mar 20 '20 at 04:15
  • If this answer is correct, the error/warning message is very misleading. It should be sth. like: "Rebalancing because Heartbeat failed". If a heartbeat fails because it's rebalancing (+-consumer, upscaling, downscaling) - who cares? – kev Sep 30 '22 at 07:50
  • "Returned in heartbeat requests when the coordinator has begun rebalancing the group. This indicates to the client that it should rejoin the group" sounds to me like "rebalancing -> heartbeat fail" and not the other way round. https://github.com/dpkp/kafka-python/blob/4d598055dab7da99e41bfcceffa8462b32931cdd/kafka/errors.py#L330 – kev Sep 30 '22 at 08:08