How to run hundreds of Kafka consumers on the same machine?

Question

In Kafka docs, it is mentioned that the consumers are not Thread-Safe. To avoid this problem, I read that it is a good idea to run a consumer for every Java process. How can this be achieved?

The number of consumers is not defined, but can change according to need.

Thank, Alessio

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

You're right that the documentation specifies that Kafka consumers are not thread-safe. However, it also says that you should run consumers on separate threads, not processes. That's quite different. See here for an answer with more specifics, geared towards Java/JVM: https://stackoverflow.com/a/15795159/236528

In general, you can have as many consumers as you want on a Kafka topic. Some of these might share a group id, in which case, all the partitions for that topic will be distributed across all the consumers active at any point in time.

There's much more detail on the Javadoc for the Kafka Consumer, linked at the bottom of this answer, but I copied the two thread/consumer models suggested by the documentation below.

1. One Consumer Per Thread

A simple option is to give each thread its own consumer instance. Here are the pros and cons of this approach:

PRO: It is the easiest to implement

PRO: It is often the fastest as no inter-thread co-ordination is needed

PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them).

CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a small cost.

CON: Multiple consumers means more requests being sent to the server and slightly less batching of data which can cause some drop in I/O throughput.

CON: The number of total threads across all processes will be limited by the total number of partitions.

2. Decouple Consumption and Processing

Another alternative is to have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing. This option likewise has pros and cons:

PRO: This option allows independently scaling the number of consumers and processors. This makes it possible to have a single consumer that feeds many processor threads, avoiding any limitation on partitions.

CON: Guaranteeing order across the processors requires particular care as the threads will execute independently an earlier chunk of data may actually be processed after a later chunk of data just due to the luck of thread execution timing. For processing that has no ordering requirements this is not a problem.

CON: Manually committing the position becomes harder as it requires that all threads co-ordinate to ensure that processing is complete for that partition. There are many possible variations on this approach. For example each processor thread can have its own queue, and the consumer threads can hash into these queues using the TopicPartition to ensure in-order consumption and simplify commit.

In my experience, option #1 is the best for starting out, and you can upgrade to option #2 only if you really need it. Option #2 is the only way to extract the maximum performance from the kafka consumer, but its implementation is more complex. So, give option #1 a try first, and see if it's good enough for your specific use case.

The full Javadoc is available at this link: https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

I have several, but I'd have to dig them out. The second CON above, depending on the use case, committing is the most difficult part, but can be vastly simplified if you have per-processor-thread queues. The rest is pretty straightforward. I can edit my post later today to add an example. — mjuarez, Aug 01 '19 at 17:00

How to run hundreds of Kafka consumers on the same machine?

1 Answers1