Does Dataframe coalesce in Spark preserve order?

Question

I use coalesce(1) to write a Dataframe to single file, like this.

df.coalesce(1).write.format("csv")
  .option("header", true).mode("overwrite").save(output_path)

A quick glance at the file shows that the order was preserved, but is it always the case? If the order is not preserved, how can I enforce it? The coalesce function of RDD has an extra parameter to disallow shuffling, but the coalesce method of Dataframe only takes 1 parameter.

I think this will need some effort to figure out, although it has been discussed again https://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order. Take a look at Avseiytsev Dmitriy answer. It seems that currently there is no guarantee for order retention of the final RDD. This is the class responsible for the coalesce algorithm https://github.com/apache/spark/blob/89c98a4c7068734e322d335cb7c9f22379ff00e8/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L156 — abiratsis, Jun 13 '20 at 20:53

score 2 · Answer 1 · answered Jun 10 '20 at 11:49

If you read a file (sc.read.text) the lines of the DataFrame/Dataset/RDD will be in the order that they were in the file.

list, map, filter,coalesce and flatMap do preserve the order. sortBy, partitionBy and join do not preserve the order.

The reason is that most DataFrame/Dataset/RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order.

In case of if you choose to use HashPartitioner and invoking invoke map on DataFrame/Dataset/RDD will change the key. In this case you can use partitionBy to restore the partitioning with a shuffle.

Does Dataframe coalesce in Spark preserve order?

1 Answers1