10

I am new to Kafka Streams, I am currently confused with the maximum parallelism of Kafka Streams application. I went through following link and did not get the answer what I am trying to find. https://docs.confluent.io/current/streams/faq.html#streams-faq-scalability-maximum-parallelism

If I have 2 input topics, one have 10 partitions and the other have 5 partitions, and only one Kafka Streams application instance is running to process these two input topics, what is the maximum thread number I can have in this case? 10 or 15?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Arvin.Z
  • 113
  • 1
  • 1
  • 7
  • when you app startup, you can see task names are of the form _. The number of kafka stream tasks is being determined by number of sub-topologies which is determined by your overall stream topology. and the number of partition is given by the number of partitions you have at each source /topic. And I recommend you read this https://medium.com/@andy.bryant/kafka-streams-work-allocation-4f31c24753cc – RoundPi May 04 '20 at 11:34

2 Answers2

21

If I have 2 input topics, one have 10 partitions and the other have 5 partitions

Sounds good. So you have 15 total partitions. Let's assume you have a simple processor topology, without joins and aggregations, so that all 15 partitions are just being statelessly transformed.

Then, each of the 15 input partitions will map to a single a Kafka Streams "task". If you have 1 thread, input from these 15 tasks will be processed by that 1 thread. If you have 15 threads, each task will have a dedicated thread to handle its input. So you can run 1 application with 15 threads or 15 applications with 1 thread and it's logically similar: you process 15 tasks in 15 threads. The only difference is that 15 applications with 1 thread allows you to spread your load over across JVMs.

Likewise, if you start 15 instances of the application, each instance with 1 thread, then each application will be assigned 1 task, and each 1 thread in each application will handle its given 1 task.

what is the maximum thread number I can have in this case? 10 or 15?

You can set your maximum thread count to anything. If your thread count across all tasks exceeds the total number of tasks, then some of the threads will remain idle.


I recommend reading https://docs.confluent.io/current/streams/architecture.html#parallelism-model, if you haven't yet. Also, study the logs your application produces when it starts up. Each thread logs the tasks it gets assigned, like this:

[2018-01-04 16:45:26,859] INFO (org.apache.kafka.streams.processor.internals.StreamThread:351) stream-thread [entities-eb9c0a9b-ecad-48c1-b4e8-715dcf2afef3-StreamThread-3] partition assignment took 110 ms.
current active tasks: [0_0, 0_2, 1_2, 2_2, 3_2, 4_2, 5_2, 6_2, 7_2, 8_2, 9_2, 10_2, 11_2, 12_2, 13_2, 14_2]
current standby tasks: []
previous active tasks: []
Dmitry Minkovsky
  • 36,185
  • 26
  • 116
  • 160
11

Dmitry's answer does not seems to be completely correct.

Then, each of the 15 input partitions will map to a single a Kafka Streams "task"

Not in general. It depends on the "structure" of your topology. It could also be only 10 tasks.

Otherwise, excellent answer from Dmitry!

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Hmmm...thanks for the reply, can you describe more detail in which condition it will be 10 tasks? – Arvin.Z Jan 09 '18 at 04:51
  • For example, if you do `stream1 = builder.stream("topic10"); stream2 = builder.stream("topic15"); stream1.merge(stream2).map().to("output)" -- in v1.0 you can get detailed runtime information: https://docs.confluent.io/current/streams/monitoring.html – Matthias J. Sax Jan 09 '18 at 22:00
  • get it, thanks. I won't perform merge or any other actions between those two input topics, they are separate data flow, so in this case I can have up to 15 non-idle threads, am i right? – Arvin.Z Jan 11 '18 at 02:02
  • I guess so -- the best way to check is locking into the logs -- the created tasks as logged -- you can have one thread per task max without ending up with idle threads. For example, if you have repartitioning step and thus get two subtopologies, the number of task can double, too. – Matthias J. Sax Jan 11 '18 at 08:09
  • 1
    Thanks for the correction! Yes, obviously that statement is incorrect generally. But is it correct under the assumption I introduce in my answer: "Let's assume you have a simple processor topology, without joins and aggregations"? Should I have added merges, too? – Dmitry Minkovsky Jan 24 '18 at 14:25
  • Merging two KStreams does not create subtopologies. – Matthias J. Sax Jan 24 '18 at 21:28