0

I have multiple Kafka Streams instances with same application.id which runs on the same machine.

So can I use same state.dir for each instance or a different state.dir

I read it somewhere that it can lock global state directory if I use same state.dir and all instances are running on same machine

Nandish Kotadia
  • 441
  • 1
  • 6
  • 21
  • It's not recommended, and if you use global state, it won't work. However, why do you want to have multiple instances? Why not increase the number of used threads of a single instance? – Matthias J. Sax Oct 17 '19 at 08:20
  • Any advantage of using more threads vs more instances? So basically we are performing join operation. And it takes a lot of time to build the local state store once I stop and start the process again if different instance take the different partition. – Nandish Kotadia Oct 17 '19 at 08:21
  • On the same machine, there is no difference during regular operations. Reusing existing state would be easier though, because during rebalance we actually know that state is on the same host -- if you use multiple instances, it's unclear during rebalance if two instances are on the same host or not and hence, "sticky" assignment is negatively impacted. – Matthias J. Sax Oct 17 '19 at 08:28
  • Okay. Thanks a lot. I Will try that. So basically we are facing scaling issues with these real-time joining since we receive around 10 million events in an hour. And it maintains state locally. Can we maintain some centrailized state like in redis or something instead of RocksDB? @MatthiasJ.Sax – Nandish Kotadia Oct 17 '19 at 09:00
  • You could implement a `StateStoreSupplier` and `TimestampedKeyValueStore` that does the communication to an external store -- but it's hard to get right and there are strings attached. It's not recommended. If you have scaling issue, try to deploy more instances on different hosts to scale out the application. – Matthias J. Sax Oct 17 '19 at 18:48
  • I want to avoid different instances for the same reason mentioned above since it might again start building the state store on restart and it is very costly. Is there anything I need to be cautious about before using StateStoreSupplier and TimestampedKeyValueStore. Any advice before jumping on that. – Nandish Kotadia Oct 18 '19 at 05:07
  • `it might again start building the state store on restart and it is very costly` -- if you scale out dynamically yes. Note that we are working on the problem already via https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams -- About building a custom central remote store, the biggest challenge would be fencing off zombies and to make exactly-once processing work. I would still not recommend to build a custom remote store. – Matthias J. Sax Oct 18 '19 at 14:28

0 Answers0