In a method, I create a new RDD, and cache it, whether Spark will unpersist the RDD automatically after the rdd is out of scope?
I am thinking so, but what's actually happens?
In a method, I create a new RDD, and cache it, whether Spark will unpersist the RDD automatically after the rdd is out of scope?
I am thinking so, but what's actually happens?
No, it won't be unpersisted automatically.
Why ? Because maybe it looks to you that the RDD is not needed anymore, but spark model is to not materialize RDD until they are needed for a transformation, so it's actually very hard to tell "I won't need this RDD" anymore. Even for you, it can be very tricky, because of the following situation :
JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create empty for merging
for (int i = 0; i < 10; i++)
{
JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
rdd.cache(); // Since it will be used twice, cache.
rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); // Transform and save, rdd materializes
rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another transform to T and merge by union
rdd.unpersist(); // Now it seems not needed. (But is needed actually)
// Here, rddUnion actually materializes, and needs all 10 rdds that already unpersisted. So, rebuilding all 10 rdds will occur.
rddUnion.saveAsTextFile(mergedFileName);
}
Credit for the code sample to the spark-user ml