I am trying to find the situations in which Spark would skip stages in case I am using RDDs. I know that it will skip stages if there is a shuffle operation happening. So, I wrote the following code to see if it is true:
def main(args: Array[String]): Unit =
{
val conf = new SparkConf().setMaster("local").setAppName("demo")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i))
val c=d.rightOuterJoin(d.reduceByKey(_+_)).collect
val f=d.leftOuterJoin(d.reduceByKey(_+_)).collect
val g=d.join(d.reduceByKey(_ + _)).collect
}
On inspecting Spark UI, I am getting the following jobs with its stages:
I was expecting stage 3 and stage 6 to be skipped as these were using the same RDD to compute the required joins(given the fact that in case of shuffle, spark automatically caches data). Can anyone please explain why am I not seeing any skipped stages here? And how can I modify the code to see the skipped stages? And are there any other situations(apart from shuffling) when Spark is expected to skip stages?