3

I know this question was asked before here: Kafka Streaming Concurrency?

But yet this is very strange to me. According to the documentation (or maybe I am missing something) each partition has a task meaning different instance of processors and each task is being execute by different thread. But when I tested it, I saw that different threads can get different instances of processor. Therefore if you want to keep any in memory state (old fashioned way) in your processor you must lock?

Example code:

public class SomeProcessor extends AbstractProcessor<String, JsonObject> {

   private final String ID = UUID.randomUUID().toString();

   @Override
   public void process(String key, JsonObject value) {
     System.out.println("Thread id: " + Thread.currentThread().getId() +" ID: " + ID);

OUTPUT:

Thread id: 88 ID: 26b11094-a094-404b-b610-88b38cc9d1ef

Thread id: 88 ID: c667e669-9023-494b-9345-236777e9dfda

Thread id: 88 ID: c667e669-9023-494b-9345-236777e9dfda

Thread id: 90 ID: 0a43ecb0-26f2-440d-88e2-87e0c9cc4927

Thread id: 90 ID: c667e669-9023-494b-9345-236777e9dfda

Thread id: 90 ID: c667e669-9023-494b-9345-236777e9dfda

Is there a way to enforce thread per instance ?

Ehud Lev
  • 2,461
  • 26
  • 38

1 Answers1

5

The number of threads per instance is a configuration parameter (num.stream.threads with default value of 1). Thus, if you start a single KafkaStreams instance you get num.stream.threads threads.

Tasks split up the work in parallel units (based on your input topic partitions) and will be assigned to threads. Thus, if you have multiple tasks and a single thread, all tasks will be assigned to this thread. If you have two threads (sum over all KafkaStreams instances) each thread executes about 50% of the tasks.

Note: because a Kafka Streams application is distributed in nature, there is no difference if you run a single KafkaStreams instance with multiple threads, or multiple KafkaStreams instanced with one thread each. Tasks will be distributed over all available threads of your application.

If you want to share any data structure between tasks and you have more then one thread, it's your responsibility to synchronize the access to this data structure. Note, that the task-to-thread assignment can change during runtime, and thus, all access must be synchronized. However, this pattern is not recommended as it limits scalability. You should design your program with no shared data structures! The main reason for this is, that your program in general is distributed over multiple machines, and thus, different KafkaStreams instances cannot access a shared data structure anyway. Sharing a data structure would only work within a single JVM but using a single JVM prevents horizontal scale out of your application.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thank you @matthias-j-sax for the reply. I actually was aware of those options. The thing is that I will not write my stream as "Not thread safe" and force using one thread because this is bad practice. I was just assuming that Kafka stream will support "thread safe" on the process / punctuate API, similar to AKKA actor pattern. But I guess I was wrong. Anyway now I understand that i have nothing other than using old fashioned locking. Thanks again – Ehud Lev Nov 07 '17 at 20:02
  • Hi @mathhias-j-sax - we are up against this issue too - is it possible that an instance of a processor could simultaneously be running process() and the punctuate() on different threads? – Jon Bates Oct 31 '18 at 15:44
  • That is not possible. The `process()` method and registered `Punctuator`s are executed in a single thread. – Matthias J. Sax Nov 01 '18 at 21:31