Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

Question

In a method, I create a new RDD, and cache it, whether Spark will unpersist the RDD automatically after the rdd is out of scope?

I am thinking so, but what's actually happens?

Similar points discussed here: [Spark RDD - is partition(s) always in RAM?](https://stackoverflow.com/a/40733821/1592191) — mrsrinivas, Aug 25 '18 at 04:05

score 3 · Accepted Answer · answered Apr 23 '15 at 09:31

No, it won't be unpersisted automatically.

Why ? Because maybe it looks to you that the RDD is not needed anymore, but spark model is to not materialize RDD until they are needed for a transformation, so it's actually very hard to tell "I won't need this RDD" anymore. Even for you, it can be very tricky, because of the following situation :

JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create empty for merging
for (int i = 0; i < 10; i++)
{
  JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
  rdd.cache(); // Since it will be used twice, cache.
  rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); //  Transform and save, rdd materializes
  rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another transform to T and merge by union
  rdd.unpersist(); // Now it seems not needed. (But is needed actually)

 // Here, rddUnion actually materializes, and needs all 10 rdds that already unpersisted. So, rebuilding all 10 rdds will occur.
 rddUnion.saveAsTextFile(mergedFileName);
}

Credit for the code sample to the spark-user ml

Hi, @C4stor thanks for your answer, but checked https://github.com/apache/spark/pull/126 and ContextCleaner.scala, seems Spark does do some auto clean RDD. So not sure how and when SPark decide it's safe to unpersist RDD. — jeffery.yuan, Apr 25 '15 at 00:18

Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

1 Answers1