1

Referring to https://spark.apache.org/docs/1.6.2/programming-guide.html#performance-impact

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed

I understand why these files will be retained. However, I cant seem to figure out whether these intermedaite files are shared between jobs?

My experimentations show that these shuffle files are NOT shared between jobs. Can anyone confirm?

The scenario I am talking about: ```

val rdd1 = sc.text...
val rdd2 = sc.text...

val rdd3 = rdd1.join(rdd2)
// at this point shuffle takes place

//Now, if I do this again:
val rdd4 = rdd1.join(rdd2) 
// will the shuffle files be reused? And I think I ve got the answer, which is know since the rdds do not share the lineage

```

human
  • 2,250
  • 20
  • 24

2 Answers2

2
  • Between jobs - yes. That's the whole purpose of preserving shuffle files (What does "Stage Skipped" mean in Apache Spark web UI?). Consider following session transcript:

    scala> val rdd1 = sc.parallelize(Seq((1, None), (2, None)), 4)
    rdd1: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[0] at parallelize at <console>:24
    
    scala> val rdd2 = sc.parallelize(Seq((1, None), (2, None)), 4)
    rdd2: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[1] at parallelize at <console>:24
    
    scala> val rdd3 = rdd1.join(rdd2)
    rdd3: org.apache.spark.rdd.RDD[(Int, (None.type, None.type))] = MapPartitionsRDD[4] at join at <console>:27
    
    scala> rdd3.count  // First job
    res0: Long = 2
    
    scala> rdd3.foreach(_ => ())  // Second job
    

    and corresponding state of the Spark UI

    enter image description here

  • Between applications - no. Shuffle files are discarded when SparkContext is closed.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Hi thanks for the answer. I do agree with you for this scenario. However, I ve updated my question with the particular scenario I am taling about, – human Jun 11 '18 at 23:37
-3

The shuffle files are meant for the stages within a job. Other jobs won't be able to use these shuffle files. So, afaik, No ! shuffle files cannot be shared between jobs