RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy()
, as explained in this reply.
Now, which operations preserve that order?
E.g., is it guaranteed that (after a.sortBy()
)
a.map(f).zip(a) ===
a.map(x => (f(x),x))
How about
a.filter(f).map(g) ===
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)
what about
a.filter(f).flatMap(g) ===
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)
Here "equality" ===
is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).