I am using karafka to read from a topic, and call an external service. Each call to external service takes roughly 300ms. And with 3 consumers (3 pods in the k8s) running in the consumer group, I expect to achieve 10 events per second. I see these loglines , which also confirm the 300ms expectation for processing each individual event. However, the overall throughput doesn't add up. Each karafka processes seems stuck for a long time between processing two batches of events.
Following instrumentation around the consume
method, implies that the consumer code itself is not taking time.
https://github.com/karafka/karafka/blob/master/lib/karafka/backends/inline.rb#L12
INFO Inline processing of topic production.events with 8 messages took 2571 ms
INFO 8 messages on production.events topic delegated to xyz
However, I notice two things:
When I tail logs on the 3 pods, only one of the 3 pods seems to emit logs a time. This does not make sense to me. As all partitions have enough events, and each consumer should be able to consumer in parallel.
Though, the above message roughly shows 321ms (2571/8) per event, in reality I see the logs stalled for a long duration between processing of two batches. I am curious, where is that time going?
====== Edit:
There is some skew in the distribution of data across brokers - as we recently expanded our brokers from 3 to total of 6. However, none of the brokers is under cpu or disk pressure. This is a new cluster, and hardly 4-5% cpu is used at peak times.
Our data is evenly distributed in 3 partitions - I say this as the last offset is roughly the same across each partition.
Partition | First Offset |
Last Offset |
Size | Leader Node |
Replica Nodes |
In-sync Replica Nodes |
Offline Replica Nodes |
Preferred Leader |
Under-replicated |
---|---|---|---|---|---|---|---|---|---|
[0] | 2174152 | 3567554 | 1393402 | 5 | 5,4,3 | 3,4,5 | Yes | No | |
1 | 2172222 | 3566886 | 1394664 | 4 | 4,5,6 | 4,5,6 | Yes | No | |
[2] | 2172110 | 3564992 | 1392882 | 1 | 1,6,4 | 1,4,6 | Yes | No |
However, I do see that one consumer perpetually lags behind the other two. Following table shows the lag for my consumers. There is one consumer process for each partition:
Partition | First Offset | Last Offset | Consumer Offset | Lag |
---|---|---|---|---|
0 | 2174152 | 3566320 | 2676120 | 890200 |
1 | 2172222 | 3565605 | 3124649 | 440956 |
2 | 2172110 | 3563762 | 3185587 | 378175 |
Combined lag | 1709331 |
Here is a screenshot of the logs from all 3 consumers. You can notice the big difference between time spent in each invocation of consume
function and interval between two adjacent invocations. Basically, i want to explain and/or reduce that waiting time. There are 100k+ events in this topic and my dummy karafka applications are able to quickly retrieve them, so kafka brokers are not an issue.
Update after setting max_wait_time to 1 second (previously 5 second)
It seems that the issue is resolved after reducing the wait config. Now the difference between two consecutive logs is roughly equal to the time spent in consume
2021-06-24 13:43:23.425 Inline processing of topic x with 7 messages took 2047 ms
2021-06-24 13:43:27.787 Inline processing of topic x with 11 messages took 3347 ms
2021-06-24 13:43:31.144 Inline processing of topic x with 11 messages took 3344 ms
2021-06-24 13:43:34.207 Inline processing of topic x with 10 messages took 3049 ms
2021-06-24 13:43:37.606 Inline processing of topic x with 11 messages took 3388 ms