3

Have a simple, maybe weird question: for the following code DAG is executed twice which is expected, because I'm calling action two times:

val input = sc.parallelize(List(1,2,3,4))
val result = input.map(x => {
  println("!!! Input Map !!!")
  errorLines.add(1)
  (x,1)
})
//.reduceByKey(_+_)
println(result.count())
println(result.collect())

If I uncomment reduceByKey line - DAG will be executed only once, although reduceByKey is transformation and I'm calling actions two times.

Does that mean that Spark just doesn't always recompute DAG?

Community
  • 1
  • 1
ALincoln
  • 431
  • 1
  • 4
  • 12

1 Answers1

4

Shuffle files in Spark serve as an implicit cache, so whenever your pipeline contains a shuffle stage (like *ByKey), and there is node failure involved, Spark will repeat only the last stage.

That being said neither using stdout, nor accumulators errorLines is a one) in transformations is reliable. During normal execution, the former one will lost, and the latter one doesn't provide exactly once guarantees.

Related to What does "Stage Skipped" mean in Apache Spark web UI?

zero323
  • 322,348
  • 103
  • 959
  • 935