Apache Flink fault tolerance

Question

Apache Flink offers a fault tolerance mechanism to consistently recover the state of data streaming applications. The mechanism ensures that even in the presence of failures, the program’s state will eventually reflect every record from the data stream exactly once.

I need to understand the answer in the following link: Flink exactly-once message processing

Does this means that Flink Sink will produce duplicate events to the external system like Cassandra?

For example:

1 - I have the following flow: source -> flatMap with state -> sink and a configured snapshot interval as 20 seconds.

What will happen if the task manger goes down (Killed) between two snapshots (after 10 seconds form the last snapshot and 10 seconds before the next snapshot).

What I know is that Flink will restart the job from the last snapshot.

In this case the Sink will reprocess all the records that already processed between the last snapshot and the down time?

score 1 · Accepted Answer · answered Feb 02 '20 at 16:12

In the scenario you've described, the Flink sink will indeed reprocess the records that had previously been sent to it since the last snapshot.

But this does not necessarily mean that the external data store (e.g., database, filesystem, or message queue) connected to the sink will end up persisting these duplicates. Flink can provide what we sometimes refer to as "exactly-once end-to-end" guarantees if the sink supports transactions, or the data is being written in an idempotent way.

Flink's Kafka producer and the StreamingFileSink are examples of sinks that can take advantage of transactions to avoid producing duplicate (or inconsistent) results.

The situation with Cassandra is somewhat more complex -- see the documentation -- and Flink can only provide exactly-once semantics if you are using idempotent queries.

Apache Flink fault tolerance

1 Answers1

Linked