2

We're trying to achieve a deduplication service using Kafka Streams. The big picture is that it will use its rocksDB state store in order to check existing keys during process.

Please correct me if I'm wrong, but to make those stateStores fault tolerant too, Kafka streams API will transparently copy the values in the stateStore inside a Kafka topic ( called the change Log). That way, if our service falls, another service will be able to rebuild its stateStore according to the changeLog found in Kafka.

But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ? I mean, When the service will update its stateStore, it will update the changelog in an exactly once fashion too.. ? If the service crash, another one will take the load, but can we sure it won't miss a stateStore update from the crashing service ?

Regards,

Yannick

Yannick
  • 1,240
  • 2
  • 13
  • 25
  • Welcome to Stack Overflow! See: [How do I do X?](https://meta.stackoverflow.com/questions/253069/whats-the-appropriate-new-current-close-reason-for-how-do-i-do-x) The expectation on Stack Overflow is that the user asking a question not only does research to answer their own question but also shares that research, code attempts, and results. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer! See also: [How to Ask](https://stackoverflow.com/help/how-to-ask) – undetected Selenium Feb 08 '19 at 12:12
  • I talked about this at Kafka Summit 2018. You can find the slides and recording on the Kafka Summit web page: https://kafka-summit.org/sessions/dont-repeat-introducing-exactly-semantics-apache-kafka/ – Matthias J. Sax Feb 08 '19 at 17:01

2 Answers2

1

Short answer is yes.

Using transaction - Atomic multi-partition write - Kafka Streams insure, that when offset commit was performed, state store was also flashed to changelog topic on the brokers. Above operations are Atomic, so if one of them will failed, application will reprocess messages from previous offset position.

You can read in following blog more about exactly once semantic https://www.confluent.io/blog/enabling-exactly-kafka-streams/. There is section: How Kafka Streams Guarantees Exactly-Once Processing.

Bartosz Wardziński
  • 6,185
  • 1
  • 19
  • 30
  • I talked about this at Kafka Summit 2018. You can find the slides and recording on the Kafka Summit web page: https://kafka-summit.org/sessions/dont-repeat-introducing-exactly-semantics-apache-kafka/ – Matthias J. Sax Feb 08 '19 at 17:01
1

But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?

Yes -- as others have already said here. You must of course configure your application to use exactly-once semantics via the configuration parameter processing.guarantee, see https://kafka.apache.org/21/documentation/streams/developer-guide/config-streams.html#processing-guarantee (this link is for Apache Kafka 2.1).

We're trying to achieve a deduplication service using Kafka Streams. The big picture is that it will use its rocksDB state store in order to check existing keys during process.

There's also an event de-duplication example application available at https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java. This links points to the repo branch for Confluent Platform 5.1.0, which uses Apache Kafka 2.1.0 = the latest version of Kafka available right now.

miguno
  • 14,498
  • 3
  • 47
  • 63