From my Spark UI. What does it mean by skipped?
2 Answers
Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey
). Whenever there is shuffling involved Spark automatically caches generated data:
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.

- 322,348
- 103
- 959
- 935
-
29Great answer. If you want to find out _way_ more about the semantics of "skipped" and "pending" stages on the web UI, check out https://github.com/apache/spark/pull/3009, the pull request which first introduced these concepts. That PR is also an interesting read if you're curious about how skipped / pending stages interact with job-level progress bars. – Josh Rosen Jan 04 '16 at 07:58
-
1If I am following correctly, Spark skipping these mean they don't happen and they can be removed from code all together? or code is very efficient with the cache so leave it? @zero323 – SparkleGoat Jan 24 '19 at 15:26
-
1@SparkleGoat No. It means that these stages have been evaluated before, and the result is available without re-execution. – 10465355 Jan 30 '19 at 21:36
-
another question, can caching and skipping stages make the output data different? – SparkleGoat Feb 15 '19 at 19:36
-
1@SparkleGoat, no caching (and skipping because of that) is an internal spark optimization and doesn't change the output data in any way. – Ravi Sanwal Oct 23 '19 at 23:30
Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.
When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.

- 2,689
- 2
- 20
- 35