1

For example, i'm using Kafka as a source using Structured Streaming.

Update: As per the documentation, one cannot set the auto.offset.reset param https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations Is there any specific reason why Structured Streaming does not leverage the auto.offset.reset value as None, i hope that could resolve the data loss by default.

Then where does Structured streaming stores the consumed offsets?

Since the startingOffsets default value is latest what happens if i have to deploy new code or re-start the application because of some failures? in this case if the value is latest then there will be data loss.

Do i still have to do something like checkpointing and write-ahead-logs etc to handle the failure?

As per this link, even after check-pointing and write-ahead-logs there will be problems in recovery semantics. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query

Can we deploy the new changes with same check-point location? when we tried with old Direct approach, it does not allow you to deploy the new changes with same check-point location.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Shankar
  • 8,529
  • 26
  • 90
  • 159
  • Possible duplicate of [How does Structured Streaming ensure exactly-once writing semantics for file sinks?](https://stackoverflow.com/questions/57704664/how-does-structured-streaming-ensure-exactly-once-writing-semantics-for-file-sin) – Jacek Laskowski Sep 09 '19 at 10:01

0 Answers0