3

In a method, I create a new RDD, and cache it, whether Spark will unpersist the RDD automatically after the rdd is out of scope?

I am thinking so, but what's actually happens?

om-nom-nom
  • 62,329
  • 13
  • 183
  • 228
jeffery.yuan
  • 1,177
  • 1
  • 17
  • 27
  • Similar points discussed here: [Spark RDD - is partition(s) always in RAM?](https://stackoverflow.com/a/40733821/1592191) – mrsrinivas Aug 25 '18 at 04:05

1 Answers1

3

No, it won't be unpersisted automatically.

Why ? Because maybe it looks to you that the RDD is not needed anymore, but spark model is to not materialize RDD until they are needed for a transformation, so it's actually very hard to tell "I won't need this RDD" anymore. Even for you, it can be very tricky, because of the following situation :

JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create empty for merging
for (int i = 0; i < 10; i++)
{
  JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
  rdd.cache(); // Since it will be used twice, cache.
  rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); //  Transform and save, rdd materializes
  rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another transform to T and merge by union
  rdd.unpersist(); // Now it seems not needed. (But is needed actually)

 // Here, rddUnion actually materializes, and needs all 10 rdds that already unpersisted. So, rebuilding all 10 rdds will occur.
 rddUnion.saveAsTextFile(mergedFileName);
}

Credit for the code sample to the spark-user ml

C4stor
  • 8,355
  • 6
  • 29
  • 47
  • 1
    Hi, @C4stor thanks for your answer, but checked https://github.com/apache/spark/pull/126 and ContextCleaner.scala, seems Spark does do some auto clean RDD. So not sure how and when SPark decide it's safe to unpersist RDD. – jeffery.yuan Apr 25 '15 at 00:18