Understanding Spark DAG Execution

Question

I'd like to understand Spark DAG model according to the official Spark docs. All transformations in Spark are lazy and by default, each transformed RDD may be recomputed each time you run an action on it. So I wrote a small program like below:

scala> val lines = sc.textFile("C:\\Spark\\README.md")
lines: org.apache.spark.rdd.RDD[String] = C:\Spark\README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> val breakLInes = lines.flatMap(line=>line.split(" "))
breakLInes: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26

scala> val createTuple = breakLInes.map(line=>(line,1))
createTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28

scala> val wordCount = createTuple.reduceByKey
reduceByKey   reduceByKeyLocally

scala> val wordCount = createTuple.reduceByKey(_+_)
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:30

scala> wordCount.first
res0: (String, Int) = (package,1)

Now moving to spark UI below is my DAG Visualization for the first action:

Again executed:

scala> wordCount.first
res0: (String, Int) = (package,1)

Now moving to spark UI below is my DAG Visualization for the second action:

By default, each transformed RDD may be recomputed each time you run an action on it. Then why it is skipping stage 2 it should have again calculated stage 2 as no caching has been done?

Spark automatically stores shuffle files and can load them as if you had called persist explicitly. See http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui — puhlen, May 03 '17 at 17:24

Understanding Spark DAG Execution

0 Answers0