4

I'd like to understand Spark DAG model according to the official Spark docs. All transformations in Spark are lazy and by default, each transformed RDD may be recomputed each time you run an action on it. So I wrote a small program like below:

scala> val lines = sc.textFile("C:\\Spark\\README.md")
lines: org.apache.spark.rdd.RDD[String] = C:\Spark\README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> val breakLInes = lines.flatMap(line=>line.split(" "))
breakLInes: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26

scala> val createTuple = breakLInes.map(line=>(line,1))
createTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28

scala> val wordCount = createTuple.reduceByKey
reduceByKey   reduceByKeyLocally

scala> val wordCount = createTuple.reduceByKey(_+_)
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:30

scala> wordCount.first
res0: (String, Int) = (package,1)

Now moving to spark UI below is my DAG Visualization for the first action:

enter image description here

Again executed:

scala> wordCount.first
res0: (String, Int) = (package,1)

Now moving to spark UI below is my DAG Visualization for the second action:

enter image description here

By default, each transformed RDD may be recomputed each time you run an action on it. Then why it is skipping stage 2 it should have again calculated stage 2 as no caching has been done?

halfer
  • 19,824
  • 17
  • 99
  • 186
Rajnish Kumar
  • 2,828
  • 5
  • 25
  • 39
  • 1
    Spark automatically stores shuffle files and can load them as if you had called persist explicitly. See http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui – puhlen May 03 '17 at 17:24

0 Answers0