118

I am unclear why we need both session.timeout.ms and max.poll.interval.ms and when would we use one or the other or both? It seems like both settings indicate the upper bound on the time the coordinator will wait to get the heartbeat from a consumer before assuming it's dead.

Also how does it behave for versions 0.10.1.0+ based on KIP-62?

Jeff Widman
  • 22,014
  • 12
  • 72
  • 88
Deeps
  • 1,879
  • 4
  • 18
  • 18

1 Answers1

276

Before KIP-62, there is only session.timeout.ms (ie, Kafka 0.10.0 and earlier). max.poll.interval.ms is introduced via KIP-62 (part of Kafka 0.10.1).

KIP-62, decouples heartbeats from calls to poll() via a background heartbeat thread, allowing for a longer processing time (ie, time between two consecutive poll()) than heartbeat interval.

Assume processing a message takes 1 minute. If heartbeat and poll are coupled (ie, before KIP-62), you will need to set session.timeout.ms larger than 1 minute to prevent consumer to time out. However, if a consumer dies, it also takes longer than 1 minute to detect the failed consumer.

KIP-62 decouples polling and heartbeat allowing to send heartbeats between two consecutive polls. Now you have two threads running, the heartbeat thread and the processing thread and thus, KIP-62 introduced a timeout for each. session.timeout.ms is for the heartbeat thread while max.poll.interval.ms is for the processing thread.

Assume, you set session.timeout.ms=30000, thus, the consumer heartbeat thread must sent a heartbeat to the broker before this time expires. On the other hand, if processing of a single message takes 1 minutes, you can set max.poll.interval.ms larger than one minute to give the processing thread more time to process a message.

If the processing thread dies, it takes max.poll.interval.ms to detect this. However, if the whole consumer dies (and a dying processing thread most likely crashes the whole consumer including the heartbeat thread), it takes only session.timeout.ms to detect it.

The idea is, to allow for a quick detection of a failing consumer even if processing itself takes quite long.

Implemenation Detail

The new timeout max.poll.interval.ms is mainly a client side concept: if poll() is not called within max.poll.interval.ms, the heartbeat thread will detect this case and send a leave-group request to the broker. -- max.poll.interval.ms is still relevant for consumer group rebalances: if a rebalance is triggered, consumers have max.poll.interval.ms time to re-join the group by calling poll() client side which triggers a join-group request.

Iskuskov Alexander
  • 4,077
  • 3
  • 23
  • 38
Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks Matthias, this clears up lot of the confusion. The fact that `max.poll.interval.ms` is introduced as part of kafka v 0.10.1 wasn't evident. In this case however, sounds like `session.timeout.ms` then could be replaced with `heartbeat.interval.ms` as the latter clearly implies what it is meant for or at least one of these should go away? – Deeps Sep 30 '16 at 19:47
  • If you have request like this, you need to write to Kafka dev mailing list. It's a community decision... But I guess, keeping `session.timeout.ms` for backward compatibility reason is a good choice. And "heartbeat.interval.ms" is not perfect because it does not indicate that there is a timeout involved. Maybe "heartbeat.max.interval.ms" would be better (still, using "timeout" in the parameter name is a strong indicator of the semantics and would get lost.) – Matthias J. Sax Oct 01 '16 at 06:07
  • @MatthiasJ.Sax I have a similar [question](https://stackoverflow.com/questions/44957417/commit-failed-for-offsets-while-committing-offset-asynchronously) based on `session.timeout.ms` in which my consumer is giving exception while committing offsets. I wanted to see if you can help me out. –  Jul 07 '17 at 03:17
  • @MatthiasJ.Sax, I am still not clear why we need both. Let us say that the consumer job is taking a very long time to consume a message. e.g consumer sends the message out to third party via a very slow rest call. Consumer can still send out heart beats at regular intervals to the broker using a background thread. max.poll.interval.ms seems redundant. – daya Aug 05 '19 at 22:05
  • 19
    Assume your consumer dies (or there is a bug with an infinite loop), but the background thread keeps heartbeating. For this case, the would not be any progress but it would be undetected. Hence, `max.poll.interval.ms` is a heath check for your main processing thread -- having both configs, allows you to detect "hard failures" (both heartbeat and main thread die) quickly, and simplify your code for long processing (with a single config you have either long detention time or complex code to trigger heartbeats during processing "manually") – Matthias J. Sax Aug 05 '19 at 22:49
  • @MatthiasJ.Sax Are you saying that if one message processing gets into an infinite loop/ takes more time than max.poll.interval.ms, there will be heartbeats from the background thread but not useful? And due to infinite loop, session.timeout.ms would be expired which would make group-protocol to assume main thread is dead? – Gibbs Jul 28 '21 at 12:48
  • And if I set max.poll.records to 500, max.poll.interval.ms=5minutes, then all records should be processed within 5minutes. Otherwise new consumer will be assigned by removing the slow consumer(In case of kafka streams, is it stream thread)? What's the use of heartbeat interval ms then? I am much confused. – Gibbs Jul 28 '21 at 13:00
  • 1
    Technically, if you process a record for longer than `max.poll.interval.ms` (ie, you don't call `poll()` within `max.poll.interval.ms`), the heartbeat threat will stop sending heartbeats but it will send a "leave group request" to trigger a rebalance. Thus session timeout does not apply for this case. -- For Kafka Streams, it's more complex -- internally, KS tries to call `poll()` regularly, even if not all records are processed yet, to avoid dropping out of the group. -- The heartbeat interval is for the case that the _whole_ application crashes. – Matthias J. Sax Jul 28 '21 at 14:40