Referring to https://spark.apache.org/docs/1.6.2/programming-guide.html#performance-impact
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed
I understand why these files will be retained. However, I cant seem to figure out whether these intermedaite files are shared between jobs?
My experimentations show that these shuffle files are NOT shared between jobs. Can anyone confirm?
The scenario I am talking about: ```
val rdd1 = sc.text...
val rdd2 = sc.text...
val rdd3 = rdd1.join(rdd2)
// at this point shuffle takes place
//Now, if I do this again:
val rdd4 = rdd1.join(rdd2)
// will the shuffle files be reused? And I think I ve got the answer, which is know since the rdds do not share the lineage
```