Spark non-cached RDD Eviction from memory

Question

From this post How long does RDD remain in memory?, I Would like to know based on the below:

An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it?

What is meant exactly by once there are no alive root objects pointing to it?

E.g. when the Action has been completed?
Or if the transforms have been executed successfully?

I read as much as I could find, but find there is always an open issue in my mind. The well-known expert's response leave a lingering doubt in my mind that I am unable to evict.

The When does a RDD lineage is created? How to find lineage graph? example is great, re-presented here:

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []

val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
 +-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
    |  MapPartitionsRDD[1] at map at <console>:25 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

Assuming that each transform takes a lengthy period for actual execution, then when can ... RDD[0] be evicted? The earliest point in time, that is. The point is that ...RDD[0] is a parent to ...RDD[1..N] or a parent to all such objects? I state this as I found such a statement elsewhere.

I do not think it is a duplicate it is seeking a clarification on the statement indicated.

My interpretation is that the term root object implies that RDD[0] cannot be subject to garbage collection until an Action has occurred or a cache or checkpoint in the Action DAG path has taken place. Seeking verification on this. The sentence for me on what the root object is, is now unclear. I would have thought the root objects are the earlier RDDs in the chain.

See also https://stackoverflow.com/questions/69597745/when-was-automatic-spark-rdd-partition-cache-eviction-implemented — samthebest, Jul 28 '22 at 14:09

score 4 · Accepted Answer · answered Apr 28 '19 at 12:26

4

An RDD have memory footprints of different kinds:

1) it consumes memory on a driver (as a regular object)

2) information about this RDD is allocated on workers

3) if an RDD is cached it may allocate additional space on workers

When an RDD becomes unreachable in terms of (1) it cleanup of (2) and (3) is triggered via ContextCleaner. So we are talking only about (1).

It doesn't matter at all whether the RDD is cached or not. Performing actions like count/collect doesn't matter as well. RDD just dies as a regular java object when you leave the scope where this RDD is visible.

In your particular example RDD1 depends on RDD0 so the latter will not be evicted unless the former is evicted. And RDD1 will be evicted only after RDD2 which will be evicted only after RDD3. And to unlock RDD3 for the garbage collector you must (vaguely speaking) leave the method where you use it.

answered Apr 28 '19 at 12:26

simpadjo

3,947
1
13
38

Thanks. Well, I meant the partitions of the RDD which implies the RDD and also non-cached. ContextCleaner was clear as well. BUT: How do you leave it then? VAGUE as you state: Exit the Spark App? Or, rather Action is completed? The term root object is thus not the object RDD0, but in the/my example it is RDD3. This means that if this was a job of 3 x 10 minutes, then at the 30 min mark the RDD0 can be considered for release - the end of the app which is the end of the Action, unless those other conditions I stated. That's how I understood and I think you are stating the same as well. – thebluephantom Apr 28 '19 at 12:41
Regarding object roots: https://www.dynatrace.com/resources/ebooks/javabook/how-garbage-collection-works/ . – simpadjo Apr 28 '19 at 12:58
Yes, in your case RDD0 is needed unless the action is completed (think about what happens if a stage failed and retry is required). In worse case it will be collected in the end of your application. But you can always help GC by extracting pieces of code into smaller methods. – simpadjo Apr 28 '19 at 13:01
That was exactly how I viewed it, so you confirm that, otherwise resilience fails of course. But your latter point is quite interesting as I tend to write inline. Means I am still up with the play, but the term threw me. – thebluephantom Apr 28 '19 at 13:04
The term root object is a little misleading imho. – thebluephantom Apr 28 '19 at 14:11
Hm, as for me it's quite accurate – simpadjo Apr 28 '19 at 14:35
Root cause is from ghe beginning where I come from. But that is downunder. – thebluephantom Apr 28 '19 at 14:51

Spark non-cached RDD Eviction from memory

1 Answers1