I'd like to understand better the consistency model of Spark 2.2 structured streaming in the following case :
- one source (Kinesis)
- 2 queries from this source towards 2 different sinks : one file sink for archive purpose (S3), and another sink for processed data (DB or file, not yet decided)
I'd like to understand if there's any consistency guarantee across sinks, at least under certain circumstances :
- Can one of the sink be way ahead of the other ? Or are they consuming data at the same speed on the source (since its the same source) ? Can they be synchronous ?
- If I (gracefully) stop the stream application, will the data on the 2 sinks consistent ?
The reason is I'd like to build a Kappa-like processing app, with the ability to suspend/shutdown the streaming part when I want to reprocess some history, and, when I resume the streaming, avoid reprocessing something that has already been processed (as being in the history), or missing some (eg. some data that has not been committed to the archive, and then skipped as already processed when the streaming resume)