Shuffle intermediate files shared between jobs?

Question

Referring to https://spark.apache.org/docs/1.6.2/programming-guide.html#performance-impact

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed

I understand why these files will be retained. However, I cant seem to figure out whether these intermedaite files are shared between jobs?

My experimentations show that these shuffle files are NOT shared between jobs. Can anyone confirm?

The scenario I am talking about: ```

val rdd1 = sc.text...
val rdd2 = sc.text...

val rdd3 = rdd1.join(rdd2)
// at this point shuffle takes place

//Now, if I do this again:
val rdd4 = rdd1.join(rdd2) 
// will the shuffle files be reused? And I think I ve got the answer, which is know since the rdds do not share the lineage

```

score 2 · Answer 1 · answered Jun 08 '18 at 09:44

Between jobs - yes. That's the whole purpose of preserving shuffle files (What does "Stage Skipped" mean in Apache Spark web UI?). Consider following session transcript:

scala> val rdd1 = sc.parallelize(Seq((1, None), (2, None)), 4)
rdd1: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(Seq((1, None), (2, None)), 4)
rdd2: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> val rdd3 = rdd1.join(rdd2)
rdd3: org.apache.spark.rdd.RDD[(Int, (None.type, None.type))] = MapPartitionsRDD[4] at join at <console>:27

scala> rdd3.count  // First job
res0: Long = 2

scala> rdd3.foreach(_ => ())  // Second job

and corresponding state of the Spark UI

Between applications - no. Shuffle files are discarded when SparkContext is closed.

Hi thanks for the answer. I do agree with you for this scenario. However, I ve updated my question with the particular scenario I am taling about, — human, Jun 11 '18 at 23:37

score -3 · Answer 2 · answered Jun 08 '18 at 07:07

-3

The shuffle files are meant for the stages within a job. Other jobs won't be able to use these shuffle files. So, afaik, No ! shuffle files cannot be shared between jobs

answered Jun 08 '18 at 07:07

Vaibhav Kulkarni

15
4

Shuffle intermediate files shared between jobs?

2 Answers2