30

I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

clay
  • 18,138
  • 28
  • 107
  • 192

2 Answers2

42

Update Oct 2020: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.

To your question:

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But, fortunately, ...

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

...because Kafka allows a topic to have many partitions, you still get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.

miguno
  • 14,498
  • 3
  • 47
  • 63
  • 2
    Link to LINE engineer's blog is broken in the meantime. You can find it here: https://engineering.linecorp.com/en/blog/detail/80 – Tim Van Laer Jul 10 '17 at 13:01
  • @MichaelG.Noll What about sharing resource among multiple threads of a single instance of streams application. If my ValueMapper is not thread safe then is it okay to run an app instance with multiple threads? – mrnakumar Sep 26 '17 at 16:08
  • Yes. The unit of work in Kafka's Streams API is a "stream task", and a stream task is run exclusively by one thread. This means your ValueMapper does not need to be thread-safe. And yes, it is ok to run an app instance with multiple threads. – miguno Sep 27 '17 at 07:36
  • I am a bit confused @miguno. Does concurrency happen at the broker level, isn't it dependent on the number of partitions and the consumer configuration for these partitions? Assume I have three consumers, three partitions on a single broker. I define the num_stream_threads_config to FIVE. What is supposed to happen? – SriniMurthy Apr 14 '22 at 06:24
  • Updating the link again for LINE blog -> https://engineering.linecorp.com/en/blog/applying-kafka-streams-for-internal-message-delivery-pipeline/ – robcxyz Jul 21 '22 at 10:19
7

The kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to simply run multiple instances of your streaming app, with all of them running the same consumer group. That way you can spin up as many instances as you need to get optimal partitioning.

Nicholas
  • 15,916
  • 4
  • 42
  • 66
  • 1
    My case is that stream tasks are HTTP calls and not CPU intensive - but wait intensive. I'd like to run a configurable amount of threads per consumer group, e.g. for 100 partitions I'd run 5 apps, 20 threads per so that each app handles 20 partitions. I can't seem to figure out how to do this. I'm assuming I'd set num.stream.threads to 20 in this case? – Lo-Tan Jan 09 '20 at 02:31
  • Did you find the right config for this ? – Catalin Apr 12 '23 at 13:14