There seem to be a few postings on this but none seem to answer what I understand.
The following code run on DataBricks:
spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7")
val checkpointDir = spark.sparkContext.getCheckpointDir.get
val ds = spark.range(10).repartition(2)
ds.cache()
ds.checkpoint()
ds.count()
ds.rdd.isCheckpointed
Added an improvement of sorts:
...
val ds2 = ds.checkpoint(eager=true)
println(ds2.queryExecution.toRdd.toDebugString)
...
returns:
(2) MapPartitionsRDD[307] at toRdd at command-1835760423149753:13 []
| MapPartitionsRDD[305] at checkpoint at command-1835760423149753:12 []
| ReliableCheckpointRDD[306] at checkpoint at command-1835760423149753:12 []
checkpointDir: String = dbfs:/dbfs/FileStore/checkpoint/cp1/loc10/86cc77b5-27c3-4049-9136-503ddcab0fa9
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
res53: Boolean = false
Question 1:
ds.rdd.isCheckpointed or ds2.rdd.isCheckpointed both return False even though with count I have a non-lazy situation. Why, when in particular the ../loc 7 & 10 are written with (part) files? Also we can see that ReliableCheckPoint!
Not well explained the whole concept. Trying to sort this out.
Question 2 - secondary question:
Is the cache really necessary or not with latest versions of Spark 2.4? A new branch on the ds, if not cached, will it cause re-computation or is that better now? Seems odd the checkpoint data would not be used, or could we say Spark does not really know what is better?
From High Performance Spark I get the mixed impression that check pointing is not so recommended, but then again it is.