How to fix 'Offset commit failed on partition com.application.iot.measure.stage-0 at offset 1053078427: The request timed out.'

Question

I have a kafka consumer application which reads IoT data from a kafka topic. But still i get the following errors/warnings erratically.

logs

   2019-06-04T23:21:03.58+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:51:03.583 ERROR 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Offset commit failed on partition com.newton.forwarding.application.iot.measure.stage-0 at offset 1053164658: The coordinator is not aware of this member.
   2019-06-04T23:21:03.58+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:51:03.583  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Asynchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053164658, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:21:03.58+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:51:03.583  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Synchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053167516, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

I have already tried out multiple combinations of the max.poll.records and max.poll.interval.ms configurations. I even tried increasing request.timeout.ms, but still these errors and warning doesn't stop.

Note: I do not have control over the broker, hence I cannot try changing session.timeout.ms as it needs to be within the range of group.min.session.timeout.ms and group.max.session.timeout.ms configuration of the broker.

application.yml

spring:
  kafka:
    consumer:
      group-id: iot
      auto-offset-reset: earliest
      properties:
        fetch.max.wait.ms: 10000
        fetch.min.bytes: 30000000
        retry.backoff.ms: 1000
        max.poll.records: 4000000
        max.poll.interval.ms: 720000
        request.timeout.ms: 900000

Currently the behaviour is as erratic as follows.

logs

   2019-06-04T23:08:03.82+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:38:03.827 ERROR 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Offset commit failed on partition com.newton.forwarding.application.iot.measure.stage-0 at offset 1053064069: The coordinator is not aware of this member.
   2019-06-04T23:08:03.82+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:38:03.827  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Asynchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053064069, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:08:03.82+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:38:03.827  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Synchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053066926, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:09:43.04+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:39:43.044  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Synchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053064069, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:10:00.13+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:00.130  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 119. No Of measures: 2857
   2019-06-04T23:10:12.90+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:12.909  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 125. No Of measures: 2893
   2019-06-04T23:10:22.94+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:22.948  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 74. No Of measures: 2880
   2019-06-04T23:10:34.44+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:34.445  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 73. No Of measures: 2862
   2019-06-04T23:10:50.50+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:50.501  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 73. No Of measures: 2866
   2019-06-04T23:10:56.08+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:56.086 ERROR 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Offset commit failed on partition com.newton.forwarding.application.iot.measure.stage-0 at offset 1053075561: The coordinator is not aware of this member.
   2019-06-04T23:10:56.08+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:56.086  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Asynchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053075561, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:10:56.08+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:40:56.086  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Synchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053078427, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:11:03.86+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:41:03.867 ERROR 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Offset commit failed on partition com.newton.forwarding.application.iot.measure.stage-0 at offset 1053078427: The coordinator is not aware of this member.
   2019-06-04T23:11:05.50+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:41:05.506  WARN 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Asynchronous auto-commit of offsets {com.newton.forwarding.application.iot.measure.stage-0=OffsetAndMetadata{offset=1053078427, leaderEpoch=null, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
   2019-06-04T23:11:33.74+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:41:33.743  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 79. No Of measures: 2862
   2019-06-04T23:11:45.66+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:41:45.664  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 109. No Of measures: 2866
   2019-06-04T23:11:56.49+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:41:56.492  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 75. No Of measures: 2880
   2019-06-04T23:12:08.39+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:42:08.390  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 90. No Of measures: 2889
   2019-06-04T23:12:15.71+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:42:15.716 ERROR 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Offset commit failed on partition com.newton.forwarding.application.iot.measure.stage-0 at offset 1053078427: The request timed out.
   2019-06-04T23:12:25.00+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:42:25.001  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 80. No Of measures: 2880
   2019-06-04T23:12:43.71+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:42:43.714  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 97. No Of measures: 2870
   2019-06-04T23:13:02.37+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:43:02.374  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 121. No Of measures: 2868
   2019-06-04T23:13:21.72+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:43:21.724  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 99. No Of measures: 2867
   2019-06-04T23:13:42.36+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:43:42.368  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 101. No Of measures: 2860
   2019-06-04T23:14:01.73+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:44:01.737  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 145. No Of measures: 2862
   2019-06-04T23:14:19.28+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:44:19.287  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 118. No Of measures: 2873
   2019-06-04T23:14:37.63+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:44:37.630  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 104. No Of measures: 2866
   2019-06-04T23:14:55.88+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:44:55.889  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 117. No Of measures: 2880
   2019-06-04T23:15:12.29+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:45:12.298  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 203. No Of measures: 2880
   2019-06-04T23:15:31.48+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:45:31.480  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 105. No Of measures: 2880
   2019-06-04T23:15:51.25+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:45:51.251  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 176. No Of measures: 2880
   2019-06-04T23:16:06.69+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:46:06.692  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 157. No Of measures: 2880
   2019-06-04T23:16:23.27+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:46:23.271  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 110. No Of measures: 2880
   2019-06-04T23:16:39.18+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:46:39.184  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 88. No Of measures: 2880
   2019-06-04T23:16:58.28+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:46:58.285  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 108. No Of measures: 2880
   2019-06-04T23:17:17.67+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:47:17.676  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 141. No Of measures: 2885
   2019-06-04T23:17:36.67+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:47:36.669  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 107. No Of measures: 2880
   2019-06-04T23:17:53.78+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:47:53.783  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 344. No Of measures: 2855
   2019-06-04T23:18:12.35+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:48:12.351  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 67. No Of measures: 2880
   2019-06-04T23:18:29.12+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:48:29.129  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 109. No Of measures: 2895
   2019-06-04T23:18:46.31+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:48:46.313  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 131. No Of measures: 2861
   2019-06-04T23:19:03.72+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:49:03.729  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 116. No Of measures: 2880
   2019-06-04T23:19:22.91+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:49:22.913  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 113. No Of measures: 2867
   2019-06-04T23:19:40.83+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:49:40.832  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 118. No Of measures: 2859
   2019-06-04T23:19:58.58+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:49:58.587  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 106. No Of measures: 2880
   2019-06-04T23:20:16.08+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:50:16.086  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 89. No Of measures: 2880
   2019-06-04T23:20:35.23+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:50:35.239  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 163. No Of measures: 2854
   2019-06-04T23:20:55.44+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:50:55.446  INFO 12 --- [ntainer#0-0-C-1] c.s.n.f.a.s.impl.ConsumerServiceImpl     : Time taken(ms) 214. No Of measures: 2858
   2019-06-04T23:21:03.58+0530 [APP/PROC/WEB/0] OUT 2019-06-04 17:51:03.583 ERROR 12 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-2, groupId=newton] Offset commit failed on partition com.newton.forwarding.application.iot.measure.stage-0 at offset 1053164658: The coordinator is not aware of this member.

Any advice to resolve this problem is highly appreciated.

P.S.: Should I consider to isolate the consumer listener tread from the processing of the data with that of worker threads as pointed out in the this question?

It's best not to do your own threading; process the records on the consumer thread. Are you running multiple instances of your app? If so a rebalance will also occur when instances stop/start. You may want to look in the server logs to see why the rebalance is occurring. — Gary Russell, Jun 04 '19 at 19:36
you are polling `max.poll.records: 4000000` and if you don't commit in time it will cause a rebalance and the records reassigned. — Paizo, Jun 05 '19 at 08:01
@GaryRussell, ok. in that case I will leave my approach to process records in a separate worker thread aside. Also, currently I am running only one instance of my consumer application. Hence rebalancing due to instance stop/start can be ruled out. Currently I can access my application's logs but those as the same as above and not helpful. The broker is a service managed by a different team. I can try to ask them access for the logs, but I doubt I will be successful in doing so :( — Dishant Kamble, Jun 06 '19 at 04:08
@Paizo, yes the max.poll.records is kept quite high due to two reasons, 1. We expect high volume of data published to our kafka topic, and we need to process it as fast as we can so that our IoT application can give near realtime insights. 2. Due to a design flaw of the published which publishes data to the broker managed by other team, data from multiple devices is pushed in the same topic, hence the high max.poll.records. Due to these reasons, its becoming difficult for me to find a sweet spot of consumer configuration. :( — Dishant Kamble, Jun 06 '19 at 04:14
i don't think an higher poll records will help, is more likely to get into issues as this one.. you can scale by adding multiple consumers, add compression and increase linger time — Paizo, Jun 06 '19 at 07:44
I cannot add more consumers as the kafka is having a single partition. With regards to adding compression & increasing linger time, can you redirect me towards any reference which i can use to understand? — Dishant Kamble, Jun 06 '19 at 07:55
One partition for such a large amount of records is a poor design. You will have to bump up the `max.poll.interval.ms`. — Gary Russell, Jun 06 '19 at 13:31
@GaryRussell, i know. We have brought this design flaw to the respective team. But we dont expect a solution from them anytime soon. So in the mean time, I am trying to figure out a way where in we can process these records without rebalancing or timeouts as well as get the desired output volume for the drain services so that they dont lag behind to analyse these records further. — Dishant Kamble, Jun 07 '19 at 04:22
@GaryRussell, what would be your suggestion to set the max.poll.interval.ms to? — Dishant Kamble, Jun 07 '19 at 04:23

score 3 · Answer 1 · answered Jun 06 '19 at 13:30

3

You need to determine how long it takes to process 4000000 records and set max.poll.interval.ms to an appropriate value.

It's a poor design to only have one partition for such a large number of records.

answered Jun 06 '19 at 13:30

Gary Russell

166,535
14
146
179

As an observation from logs, each poll retrieves approx. 2900 records & takes around 130ms to process them to sink. Although no. of records processed will vary per poll in different landscape based on the no. of device connected to it. I tried reducing the config as follows but still no luck ` spring: kafka: consumer: group-id: newton auto-offset-reset: earliest properties: fetch.min.bytes: 5000000 retry.backoff.ms: 1000 max.poll.records: 450000 max.poll.interval.ms: 300000 request.timeout.ms: 900000 ` – Dishant Kamble Jun 07 '19 at 04:31
Additionally, if I bump up the max.poll.interval.ms -> would it imply, that I need to bump up the request.timeout.ms as well in such a way that request.timeout.ms > max.poll.interval.ms – Dishant Kamble Jun 07 '19 at 04:33
1

The algorithm is quite simple; you must call `poll()` within `max.poll.interval.ms` since the last `poll()`. Spring calls `poll()` immediately after the last message from the previous `poll()` is processed. If it only takes 130ms to process 2900 records then something else is causing the rebalance. If each one takes 130ms then `377000` should be enough, but you can't use averages. You must consider the worst case scenario, or limit the `max.poll.records` such that you can always process them within `max.poll.timeout.ms`. `request.timeout.ms` is unrelated - read the documentation. – Gary Russell Jun 07 '19 at 12:41
I guess I know what was causing the rebalancing. I talked to the team managing the broker and found that they use the default configuration which implied that `group.min.session.timeout.ms` & `group.min.session.timeout.ms` were set to 6sec and 5min respectively. In my consumer I never tried session.timeout.ms less than 5min. i tried with higher / let the default take over. Default for session.timeout.ms is 10sec. I suppose this was causing the frequent rebalancing. – Dishant Kamble Jun 08 '19 at 15:52
1

That being said. new configurations are `fetch.min.bytes: 1000000 retry.backoff.ms: 2000 max.poll.records: 300000 max.poll.interval.ms: 270000 request.timeout.ms: 300000 session.timeout.ms: 290000 heartbeat.interval.ms: 30000` which resolved rebalancing issue, but i still have the request timed out issue. Also due to this configuration, my poll results are as follows `Time taken(ms) 92. No Of measures: 45` significantly less. :( Any suggestions? – Dishant Kamble Jun 08 '19 at 15:55
Is there a way that we can update this config and not require a restart of Spring application? – nikhilvora Feb 14 '21 at 14:13
Don't ask new questions in comments; it's hard to provide a complete answer with code etc. Yes, you can stop the container, update the `kafkaConsumerProperties` in the `containerProperties` and restart the container; if you need more details, ask a new question. – Gary Russell Feb 16 '21 at 14:38

How to fix 'Offset commit failed on partition com.application.iot.measure.stage-0 at offset 1053078427: The request timed out.'

1 Answers1