There are claims that a Spark RDD must be a deterministic function of its inputs, due to recomputation and fault tolerance, but there are also sanctioned non-deterministic RDD's, for example in SparkSQL or SparkML. Is there any formal guidance on how to use nondeterminism safely?
Consider this Spark job, with a diamond-shaped DAG.
val bar = (rdd map f) zip (rdd map g)
bar.saveAsTextFile("outfile")
If rdd
is nondeterministic (e.g., random or timestamp), will outfile
contain consistent data? Is it possible one component of the zip will be recomputed and the other component not? Is safety guaranteed if we checkpoint or persist rdd
? Would a local checkpoint suffice?