How to write Kafka consumers - single threaded vs multi threaded

Question

I have written a single Kafka consumer (using Spring Kafka), that reads from a single topic and is a part of a consumer group. Once a message is consumed, it will perform all downstream operations and move on to the next message offset. I have packaged this as a WAR file and my deployment pipeline pushes this out to a single instance. Using my deployment pipeline, I could potentially deploy this artifact to multiple instances in my deployment pool.

However, I am not able to understand the following, when I want multiple consumers as part of my infrastructure -

I can actually define multiple instances in my deployment pool and have this WAR running on all those instances. This would mean, all of them are listening to the same topic, are a part of the same consumer group and will actually divide the partitions among themselves. The downstream logic will work as is. This works perfectly fine for my use case, however, I am not sure, if this is the optimal approach to follow ?
Reading online, I came across resources here and here, where people are defining a single consumer thread, but internally, creating multiple worker threads. There are also examples where we could define multiple consumer threads that do the downstream logic. Thinking about these approaches and mapping them to deployment environments, we could achieve the same result (as my theoretical solution above could), but with less number of machines.

Personally, I think my solution is simple, scalable but might not be optimal, while the second approach might be optimal, but wanted to know your experiences, suggestions or any other metrics / constraints I should consider ? Also, I am thinking with my theoretical solution, I could actually employ bare bones simple machines as Kafka consumers.

While I know, I haven’t posted any code, please let me know if I need to move this question to another forum. If you need specific code examples, I can provide them too, but I didn’t think they are important, in the context of my question.

score 9 · Accepted Answer · answered Apr 26 '18 at 21:32

9

Your existing solution is best. Handing off to another thread will cause problems with offset management. Spring kafka allows you to run multiple threads in each instance, as long as you have enough partitions.

answered Apr 26 '18 at 21:32

Gary Russell

166,535
14
146
179

Your suggestion actually hints toward the second solution. In my existing solution, I have a single threaded consumer, deployed to multiple instances. However, using Spring Kafka, if I can easily define multiple threads within a single WAR and deploy this WAR to multiple instances, then I am optimizing my existing solution. I believe you are referring to using `ConcurrentKafkaListenerContainerFactory` and being able to set the concurrency based on my topic partitions. Also, since Spring is managing the consumer threads, there will be a cleaner lifecycle for consumer thread management ? – user3842182 Apr 30 '18 at 18:47
No; that is the wrong interpretation; your second bullet says `where people are defining a single consumer thread, but internally, creating multiple worker threads`. With the concurrent container there is a separate `Consumer` instance (and thread) for each consumer. There is no notion of handing off to a "worker" thread. It is the equivalent of having `n` containers in each WAR (which is actually what happens inside the concurrent container). – Gary Russell Apr 30 '18 at 19:46
@GaryRussell I am also implementing multi threaded consumer with thread pool via executor service, I am seeing some strange issues such as when among some of parallel threads , if there are some failures and some messages are consumed succesfully, then all unsuccessful parsed messages are lost (data loss) since Kafka only understands and updates its offset from last commit, so if any commit came newer , then it resets the lag and sees as if till that offset, all was processed (although some got killed by other thread) , is there any other way or situation is still same ? – SOURAV KUMAR Apr 03 '23 at 10:38
That is exactly why you should not do it. – Gary Russell Apr 03 '23 at 12:43

score 2 · Answer 2 · answered Apr 27 '18 at 11:06

If your current approach works, just stick to it. It's the simple and elegant way to go.

You would only go to approach 2 in case you cannot for some reason increase the number of partitions but need higher level of parallelism. But then you have ordering and race conditions to worry about. If you ever need to go that route, I'd recommend the akka-stream-kafka library, which provides facilities to handle offset commits correctly and to do what you need in parallel and then merge back into a single stream preserving the original ordering, etc. Otherwise, these things are error-prone to do yourself.

thanks for the information! right now, since i am also able to define topic partitions, i can be careful and look ahead with regards to capacity. but yes, i will keep the tool in mind, in case i need it for the future. — user3842182, Apr 30 '18 at 18:50

How to write Kafka consumers - single threaded vs multi threaded

2 Answers2

Linked