I found similiar topic: Understanding Spark's caching
but it is still not exact my question. Let's consider below snippets of code: OptionA:
rdd1 = sc.textFile()
rdd1.cache()
rdd2 = rdd1.map().partionBy()
rdd3 = rdd1.reduceBy().map()
rdd2.cache()
rdd1.unpersist()
data = rdd2.collect()
OptionB:
rdd1 = sc.textFile()
rdd1.cache()
rdd2 = rdd1.map().partionBy()
rdd3 = rdd1.reduceBy().map()
rdd2.cache()
data = rdd2.collect()
rdd1.unpersist()
Which option should I choose to prevent recomputing of rdd1
? At the first glance, optionA looks ok, but having in mind that operation in spark are lazy I think that doing unpersist
before doing action on rdd2
can result in need of recomputing rdd1
once again. On the other hand, calling unpersist
, as in optionB can result in no free space to cache rdd2
. Please help me to choose which option should I use.