In my spark application, I am reading few hive tables in spark rdd and then performing few transformation on those rdds later. To avoid re computation I cached those rdds using rdd.cache()
or rdd.persist()
and rdd.checkpoint()
methods.
As per spark documentation and online references I was of opinion that checkpointing operation is costlier than caching. Though caching keeps rdd lineage and checkpointing breaks it but checkpointing writes and reads from HDFS.
Strange thing I observed in my case is, I see checkpointing stage is faster (nearly 2 times) than caching/persisting(memory only). I ran multiple times and still results were similar.
I am not able to understand why this happening. Any help would be helpful.