I am trying to understand the content of a checkpoint and corresponding recovery; understanding the process of checkpointing is obviously the natural way of going about it and so I went over the following list:
I still am struggling to understand what goes and sits on the disk at the end of a checkpoint.
My understanding of Spark Checkpointing:
If you have really long DAGs and your spark cluster fails, checkpointing helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50 transformations can be reduced to 4-5 transformations with the help of checkpointing. It breaks the DAG though.
Checkpointing in Streaming
My Spark Streaming job has a microbatch of 5 seconds. As I understand, a new job is submitted by the JobScheduler every 5 secs that invokes the JobGenerator to generate the RDD DAG for the new microbatch from the DStreamGraph, while the receiver in the meantime keeps collecting the data for the next new microbatch for the next cycle. If I enable checkpointing, as I understand, it will periodically keep checkpointing the "current state".
Question:
What is this "state"? Is this the combination of the base RDD and the state of the operators/transformations of the DAG for the present microbatch only? So I have the following:
ubatch 0 at T=0 ----> SUCCESS ubatch 1 at T=5 ----> SUCCESS ubatch 2 at T=10 ---> SUCCESS --------------------> Checkpointing kicks in now at T=12 ubatch 3 at T=15 ---> SUCCESS ubatch 4 at T=20 --------------------> Spark Cluster DOWN at T=23 => ubatch 4 FAILS!!! ... --------------------> Spark Cluster is restarted at *T=100*
What specifically goes and sits on the disk as a result of checkpointing at T=12? Will it just store the present state of operators of the DAG for ubatch 2?
a. If yes, then during recovery at T=100, the last checkpoint available is at T=12. What happens to the ubatch 3 at T=15 which was already processed successfully. Does the application reprocess ubatch 3 and handle idempotency here? If yes, do we go to the streaming source e.g. Kafka and rewind the offset to be able to replay the contents starting from the ubatch 3?
b. If no, then what exactly goes into the checkpoint directory at T=12?