For example, i'm using Kafka as a source using Structured Streaming.
Update: As per the documentation, one cannot set the auto.offset.reset
param https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations Is there any specific reason why Structured Streaming does not leverage the auto.offset.reset
value as None
, i hope that could resolve the data loss by default.
Then where does Structured streaming stores the consumed offsets?
Since the startingOffsets
default value is latest
what happens if i have to deploy new code or re-start the application because of some failures? in this case if the value is latest
then there will be data loss.
Do i still have to do something like checkpointing
and write-ahead-logs
etc to handle the failure?
As per this link, even after check-pointing and write-ahead-logs there will be problems in recovery semantics. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query
Can we deploy the new changes with same check-point location? when we tried with old Direct approach, it does not allow you to deploy the new changes with same check-point location.