1

I am trying to understand better kafka stream performance and I would like to understand what happens when a kstream application starts and it uses a ktable with millions of record. like... is there a GET/query operation to materialize ktable topic's information inside the local rocksDB? any additional information about performance issues when starting up( mainly after a forced break) will be appreciated.

tried courses and articles on internet. Not too much valuable info found. Mostly only cares about how to use it.

1 Answers1

2

A KTable in Kafka Streams is generally a partitioned store backed by:

  • A Changelog Topic on the Kafka Broker (for resilience)
  • A RocksDB Instance for each input partition number (i.e. Kafka Streams Task).

When you query a KTable, you (generally) need to do two things:

  1. Determine the Streams Instance that hosts the partition for your key, via KafkaStreams#queryMetadataForKey().
  2. Use Interactive Queries to contact that host, which then does (to simplify things) a store.get(key).

The store.get() is a lookup on RocksDB.

Every time your app writes a record to RocksDB, it is also written to your changelog topic. That way, if one of your streams app instances crashes, another instance can "replay" the changelog topic to build the state that used to live on the now-dead instance.

That process is called "restoration" and it can take a long time. So, you can enable Standby Tasks for faster failover. A Standby Task is just a task on a streams instance that reads the changelog in real-time so that if another streams instance crashes, the data is already "hot" and ready to go on a currently-live streams instance.

A note on RocksDB performance: it tends to degrade once you get past 30GB of data in one store. So use that as a guideline for how many input partitions you need to specify.

cmcnealy
  • 312
  • 1
  • 7