Why does spark need both write ahead log and checkpoint?
Why can’t we only use checkpoint? What is the benefit of additionally using write ahead log?
What is the difference between the data stored in WAL and in checkpoint?
Why does spark need both write ahead log and checkpoint?
Why can’t we only use checkpoint? What is the benefit of additionally using write ahead log?
What is the difference between the data stored in WAL and in checkpoint?
If you read around you will get the gist, but it is not that easy. Here goes, with focus on Spark Structured Streaming.
Quoting from https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-structured-streaming-overview, for Spark Streaminng (legacy) and Spark Structured Streaming:
Checkpointing and write-ahead logs
- To deliver resiliency and fault tolerance, Structured Streaming relies on checkpointing to ensure that stream processing can continue uninterrupted, even with node failures. In HDInsight, Spark creates checkpoints to durable storage, either Azure Storage or Data Lake Storage. These checkpoints store the progress information about the streaming query. Thus, a checkpoint helps build fault-tolerant and resilient Spark applications, e.g. Driver failure and Worker failure. The sink (writing to) needs to be processed in an idempotent way, however, as we may otherwise write / process more than once.
- In addition, Structured Streaming uses the Write-Ahead Log (WAL). The WAL captures ingested data that has been received, but not yet processed by a query. If a failure occurs and processing is restarted from the WAL, any events received from the source aren't lost.
Note that if you are using, say, KAFKA, and recreate the Topic, thus working with offsets outside of Spark Structured Streaming itself, there are considerations as well. Hence the inclusion of this link that states it better than I can: https://dev.to/kevinwallimann/how-to-recover-from-a-kafka-topic-reset-in-spark-structured-streaming-3phd.
In addition this link: How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?
In summary not that simple, most guides are too simplistic or do not treat the topic in its entirety. Hope this helps.