From this post How long does RDD remain in memory?, I Would like to know based on the below:
An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it?
What is meant exactly by once there are no alive root objects pointing to it?
- E.g. when the Action has been completed?
- Or if the transforms have been executed successfully?
I read as much as I could find, but find there is always an open issue in my mind. The well-known expert's response leave a lingering doubt in my mind that I am unable to evict.
The When does a RDD lineage is created? How to find lineage graph? example is great, re-presented here:
val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []
val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
+-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
| MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
Assuming that each transform takes a lengthy period for actual execution, then when can ... RDD[0] be evicted? The earliest point in time, that is. The point is that ...RDD[0] is a parent to ...RDD[1..N] or a parent to all such objects? I state this as I found such a statement elsewhere.
I do not think it is a duplicate it is seeking a clarification on the statement indicated.
My interpretation is that the term root object implies that RDD[0] cannot be subject to garbage collection until an Action has occurred or a cache or checkpoint in the Action DAG path has taken place. Seeking verification on this. The sentence for me on what the root object is, is now unclear. I would have thought the root objects are the earlier RDDs in the chain.