Apache Spark: Dag is not executed twice for reduceByKey

Question

Have a simple, maybe weird question: for the following code DAG is executed twice which is expected, because I'm calling action two times:

val input = sc.parallelize(List(1,2,3,4))
val result = input.map(x => {
  println("!!! Input Map !!!")
  errorLines.add(1)
  (x,1)
})
//.reduceByKey(_+_)
println(result.count())
println(result.collect())

If I uncomment reduceByKey line - DAG will be executed only once, although reduceByKey is transformation and I'm calling actions two times.

Does that mean that Spark just doesn't always recompute DAG?

score 4 · Accepted Answer · answered Jul 03 '17 at 17:33

Shuffle files in Spark serve as an implicit cache, so whenever your pipeline contains a shuffle stage (like *ByKey), and there is node failure involved, Spark will repeat only the last stage.

That being said neither using stdout, nor accumulators errorLines is a one) in transformations is reliable. During normal execution, the former one will lost, and the latter one doesn't provide exactly once guarantees.

Related to What does "Stage Skipped" mean in Apache Spark web UI?

Apache Spark: Dag is not executed twice for reduceByKey

1 Answers1