4

In my spark application, I am reading few hive tables in spark rdd and then performing few transformation on those rdds later. To avoid re computation I cached those rdds using rdd.cache() or rdd.persist() and rdd.checkpoint() methods.

As per spark documentation and online references I was of opinion that checkpointing operation is costlier than caching. Though caching keeps rdd lineage and checkpointing breaks it but checkpointing writes and reads from HDFS.

Strange thing I observed in my case is, I see checkpointing stage is faster (nearly 2 times) than caching/persisting(memory only). I ran multiple times and still results were similar.

I am not able to understand why this happening. Any help would be helpful.

fdelafuente
  • 1,114
  • 1
  • 13
  • 24
Nachiket Kate
  • 8,473
  • 2
  • 27
  • 45

1 Answers1

1

I have ran similar benchmarks lately and I am experiencing the same: checkpoint() is faster despite more I/O. My explanation is that keeping the whole lineage is a costly operation.

I ran benchmarks on 1, 10, 100, 1000, 10000, 100000, 1000000, 2000000, 10m, and more records and checkpoint was always faster. The lineage was pretty simple (record filtering then a couple of aggregations). Storage was local on NVMe drives (not block over the network). I guess it really depends on a lot of criteria.

jgp
  • 2,069
  • 1
  • 21
  • 40