Why Apache Kafka Streams uses RocksDB and if how is it possible to change it?

Question

During investigation within new features in Apache Kafka 0.9 and 0.10, we had used KStreams and KTables. There is an interesting fact that Kafka uses RocksDB internally. See Introducing Kafka Streams: Stream Processing Made Simple. RocksDB is not written in JVM compatible language, so it needs careful handling of the deployment, as it needs extra shared library (OS dependent).

And here there are simple questions:

Why Apache Kafka Streams uses RocksDB?
How is it possible to change it?

I had tried to search the answer, but I see only implicit reason, that RocksDB is very fast for operations in the range of about millions of operations per second.

On the other hand, I see some DBs that are coded in Java and perhaps end to end they could do that as well as they are not going over JNI.

@miguno: you are right if there are no bugs :-). But when bugs occurs and or debug sessions are needed any non-native code makes actions quite complicated, or? The second matter is that I do not see in the documentation any specification which states on which platforms Kafka Streams will run, as it will be limited by RocksDB shared libraries. It is a matter of transparency. — Seweryn Habdank-Wojewódzki, Oct 19 '16 at 08:12

score 36 · Accepted Answer · edited Feb 03 '22 at 14:49

RocksDB is used for several (internal) reasons (as you mentioned already for example its performance). Conceptually, Kafka Streams does not need RocksDB -- it is used as internal key-value cache and any other store offering similar functionality would work, too.

Comment from @miguno below (rephrased):

One important advantage of RocksDB in contrast to pure in-memory key-value stores is its ability to write to disc. Thus, a state larger than available main memory can be supported by Kafka Streams.

Comment from @miguno above:

FYI: "RocksDB is not written in JVM compatible language, so it needs careful handling of the deployment, as it needs extra shared library (OS dependent)." As a user of Kafka Streams you don't need to install anything.

Using Kafka Streams DSL, as of 0.10.2 release (KAFKA-3825) it's possible to plug in custom state stores and to use a different key-value store.

Using Kafka Streams Processor API, you can implement your own store via StateStore interface and connect it to a processor node in your topology.

In kafka introduction page, it states that ```uses Kafka for stateful storage ```. Is it lying? https://kafka.apache.org/intro — petertc, Nov 06 '18 at 07:39
Not sure why you think this would not be correct? Note, that local stores are caches only -- all data that is in the stores, is also stored in a Kafka topic. — Matthias J. Sax, Nov 06 '18 at 17:31

Why Apache Kafka Streams uses RocksDB and if how is it possible to change it?

1 Answers1

Linked