How to achieve exactly-once write guaranty with foreachBatch sink in Spark Structured Streaming

Question

From the docs:

By default, foreachBatch provides only at-least-once write guarantees. However, you can use the batchId provided to the function as way to deduplicate the output and get an exactly-once guarantee.

Does this mean that unless we go the extra mile, we wouldn't achieve exactly-once write guaranty even if we use checkpointing in the main writeStream operation?

If yes, what should be done to achieve exactly-once write guaranty? What is meant in the docs by:

using the batchId ?

PS: The original question was specific for Kafka but I generalized it as the solution would apply to anything within the foreachBatch block.

score 0 · Answer 1 · answered Jun 09 '21 at 18:32

Does this mean that unless we go the extra mile, we wouldn't achieve exactly-once write guaranty even if we use checkpointing in the main writeStream operation?

Structured streaming guarantees at least one semantics meaning each record will be present at least once, even if we have checkpointing in place there is no guarantee that duplicate records will not be present.

If yes, what should be done to achieve exactly-once write guaranty? What is meant in the docs by

The way to achieve exactly once semantics will vary depending upon the data sink one choses to use.

For the sake of explanation lets take elastic search as a data sink. ES as we know is a document store and each record is given a unique doc_id. Lets say we have a data frame with the following schema -

|-- count: long (nullable = false)
|-- department: string (nullable = false)
|-- start_time: string (nullable = false)
|-- end_time: string (nullable = false)

In this case we have (department,start_time,end_time) as our key columns which will be unique, we can find the hash of these columns and use it as doc_id column in elastic search using index-api.

Using this way , in-case of duplicate records they will be hashed to the same value and as ES does not allow duplicate doc_id's in a particular a index it will update the document with the same record and increments its version.

Similar kind of approaches can be followed in other data sinks as well.

how does this use batchid for deduplication ? – catGPT Jun 19 '23 at 09:37 — catGPT, Jun 19 '23 at 09:37

How to achieve exactly-once write guaranty with foreachBatch sink in Spark Structured Streaming

1 Answers1