2

I use coalesce(1) to write a Dataframe to single file, like this.

df.coalesce(1).write.format("csv")
  .option("header", true).mode("overwrite").save(output_path)

A quick glance at the file shows that the order was preserved, but is it always the case? If the order is not preserved, how can I enforce it? The coalesce function of RDD has an extra parameter to disallow shuffling, but the coalesce method of Dataframe only takes 1 parameter.

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
  • 3
    Specify the `orderBy` after coalesce to enforce ordering – Som Jun 10 '20 at 11:57
  • May be time for the high priest to indicate an answer. – thebluephantom Jun 11 '20 at 07:22
  • 1
    I think this will need some effort to figure out, although it has been discussed again https://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order. Take a look at Avseiytsev Dmitriy answer. It seems that currently there is no guarantee for order retention of the final RDD. This is the class responsible for the coalesce algorithm https://github.com/apache/spark/blob/89c98a4c7068734e322d335cb7c9f22379ff00e8/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L156 – abiratsis Jun 13 '20 at 20:53

1 Answers1

2

If you read a file (sc.read.text) the lines of the DataFrame/Dataset/RDD will be in the order that they were in the file.

list, map, filter,coalesce and flatMap do preserve the order. sortBy, partitionBy and join do not preserve the order.

The reason is that most DataFrame/Dataset/RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order.

In case of if you choose to use HashPartitioner and invoking invoke map on DataFrame/Dataset/RDD will change the key. In this case you can use partitionBy to restore the partitioning with a shuffle.

QuickSilver
  • 3,915
  • 2
  • 13
  • 29