Does this mean that unless we go the extra mile, we wouldn't achieve exactly-once write guaranty even if we use checkpointing in the main writeStream operation?
Structured streaming guarantees at least one semantics meaning each record will be present at least once, even if we have checkpointing in place there is no guarantee that duplicate records will not be present.
If yes, what should be done to achieve exactly-once write guaranty? What is meant in the docs by
The way to achieve exactly once semantics will vary depending upon the data sink one choses to use.
For the sake of explanation lets take elastic search as a data sink.
ES as we know is a document store and each record is given a unique doc_id.
Lets say we have a data frame with the following schema -
|-- count: long (nullable = false)
|-- department: string (nullable = false)
|-- start_time: string (nullable = false)
|-- end_time: string (nullable = false)
In this case we have (department,start_time,end_time) as our key columns which will be unique, we can find the hash of these columns and use it as doc_id column in elastic search using index-api.
Using this way , in-case of duplicate records they will be hashed to the same value and as ES does not allow duplicate doc_id's in a particular a index it will update the document with the same record and increments its version.
Similar kind of approaches can be followed in other data sinks as well.