In my company, we are using Kafka extensively, but we have been using relational database to store results of several intermediary transformations and aggregations for fault tolerance reasons. Now we are exploring Kafka Streams as a more natural way to do this. Often, our needs are quite simple - one such case is
- Listen to an input queue of
<K1,V1>, <K2,V2>, <K1,V2>, <K1,V3>...
- For each record, perform some high latency operation (call a remote service)
- If by the time
<K1,V1>
is processed, and both<K1,V2>, <K1,V3>
have been produced, then I should process V3 as V2 has already become stale
In-order to achieve this, I am reading the topic as a KTable
. Code looks like below
KStreamBuilder builder = new KStreamBuilder();
KTable<String, String> kTable = builder.table("input-topic");
kTable.toStream().foreach((K,V) -> client.post(V));
return builder;
This works as expected, but it is not clear to me how Kafka automagically achieves this. I was assuming that Kafka creates internal topics to achieve this, but I do not see any internal topics created. Javadoc for the method seem to confirm this observation. But then I came across this official page which seem to suggest that Kafka uses a separate datastore aka RocksDB along with a changelog topic.
Now I am confused, as under which circumstances are changelog topic created. My questions are
- If the default behaviour of state store is fault-tolerant as suggested by official page, then where is that state stored? In RocksDB? In changelog topic or both?
- What are the implications of relying on RocksDB in production? (EDITED)
- As I understood, the dependency to rocksdb is transparent (just a jar file) and rocksdb stores data in local file system. But this would also means that in our case, that application will maintain a copy of the sharded data on the storage where application is running. When we replace a remote database with a KTable, it has storage implications and that is my point.
- Will Kafka releases take care that RocksDB will continue to work on various platforms? (As it seem to be platform dependent and not written in Java)
- Does it make sense to make input-topic log compacted?
I am using v. 0.11.0