Long delays between processing of two consecutive kafka batches (using ruby/karafka consumer)

Question

I am using karafka to read from a topic, and call an external service. Each call to external service takes roughly 300ms. And with 3 consumers (3 pods in the k8s) running in the consumer group, I expect to achieve 10 events per second. I see these loglines , which also confirm the 300ms expectation for processing each individual event. However, the overall throughput doesn't add up. Each karafka processes seems stuck for a long time between processing two batches of events.

Following instrumentation around the consume method, implies that the consumer code itself is not taking time.

https://github.com/karafka/karafka/blob/master/lib/karafka/backends/inline.rb#L12

INFO Inline processing of topic production.events with 8 messages took 2571 ms

INFO 8 messages on production.events topic delegated to xyz

However, I notice two things:

When I tail logs on the 3 pods, only one of the 3 pods seems to emit logs a time. This does not make sense to me. As all partitions have enough events, and each consumer should be able to consumer in parallel.
Though, the above message roughly shows 321ms (2571/8) per event, in reality I see the logs stalled for a long duration between processing of two batches. I am curious, where is that time going?

====== Edit:

There is some skew in the distribution of data across brokers - as we recently expanded our brokers from 3 to total of 6. However, none of the brokers is under cpu or disk pressure. This is a new cluster, and hardly 4-5% cpu is used at peak times.

Our data is evenly distributed in 3 partitions - I say this as the last offset is roughly the same across each partition.

Partition	First Offset	Last Offset	Size	Leader Node	Replica Nodes	In-sync Replica Nodes	Preferred Leader	Under-replicated
[0]	2174152	3567554	1393402	5	5,4,3	3,4,5	Yes	No
1	2172222	3566886	1394664	4	4,5,6	4,5,6	Yes	No
[2]	2172110	3564992	1392882	1	1,6,4	1,4,6	Yes	No

However, I do see that one consumer perpetually lags behind the other two. Following table shows the lag for my consumers. There is one consumer process for each partition:

Partition	First Offset	Last Offset	Consumer Offset	Lag
0	2174152	3566320	2676120	890200
1	2172222	3565605	3124649	440956
2	2172110	3563762	3185587	378175
Combined lag	1709331

Here is a screenshot of the logs from all 3 consumers. You can notice the big difference between time spent in each invocation of consume function and interval between two adjacent invocations. Basically, i want to explain and/or reduce that waiting time. There are 100k+ events in this topic and my dummy karafka applications are able to quickly retrieve them, so kafka brokers are not an issue.

Update after setting max_wait_time to 1 second (previously 5 second)

It seems that the issue is resolved after reducing the wait config. Now the difference between two consecutive logs is roughly equal to the time spent in consume

2021-06-24 13:43:23.425 Inline processing of topic x  with 7 messages took 2047 ms
2021-06-24 13:43:27.787 Inline processing of topic x with 11 messages took 3347 ms
2021-06-24 13:43:31.144 Inline processing of topic x with 11 messages took 3344 ms
2021-06-24 13:43:34.207 Inline processing of topic x with 10 messages took 3049 ms
2021-06-24 13:43:37.606 Inline processing of topic x with 11 messages took 3388 ms

score 1 · Accepted Answer · answered Jun 24 '21 at 08:27

There are a couple of problems you may be facing. It is a bit of a guessing from my side without more details but let's give it a shot.

From the Kafka perspective

Are you sure you're evenly distributing data across partitions? Maybe it is eating up things from one partition?

What you wrote here:

INFO Inline processing of topic production.events with 8 messages took 2571 ms

This indicates that there was a batch of 8 processed altogether by a single consumer. This could indicate that the data is not distributed evenly.

From the performance perspective

There are two performance properties that can affect your understanding of how Karafka operates: throughput and latency.

Throughput is the number of messages that can be processed in a given time
Latency is the time it takes a message from the moment it was produced to it been processed.

As far as I understand, all messages are being produced. You could try playing with the Karafka settings, in particular this one: https://github.com/karafka/karafka/blob/83a9a5ba417317495556c3ebb4b53f1308c80fe0/lib/karafka/setup/config.rb#L114

From the logger perspective

Logger that is being used flushes data from time to time, so you won't see it immediately but after a bit of time. You can validate this by looking at the log time.

Thanks @Maciej, (did I make you sign up for stackoverflow?) - I will add details regarding data distribution in the original post. — Asad Iqbal, Jun 24 '21 at 08:33
I posted the log to show that the time spent in the `consume` method matches our expectation. This batch will belong to only one partition. (as we have 3 topic-partitions and 3 consumers in the consumer group). Why do you think this log indicates a skew? — Asad Iqbal, Jun 24 '21 at 08:45
I will play with the wait related config, which you pointed to. That seems relevant! Thanks. — Asad Iqbal, Jun 24 '21 at 08:46
The comment was not properly formatted so i made it part of the original post. Now, I have a followup question: Do you think it is a bug in the max_wait_time implementation? Event after reducing the max_wait_time to 1 (from 5), batch size is roughly the same and I have trimmed the unnecessary waits. I have a lot of backlog of messages in kafka, so the wait cannot be caused because the consumer's fetcher is waiting for more events to be produced. — Asad Iqbal, Jun 24 '21 at 13:52

Long delays between processing of two consecutive kafka batches (using ruby/karafka consumer)

1 Answers1

From the Kafka perspective

From the performance perspective

From the logger perspective