Difference between repartition(1) and coalesce(1)

Question

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.

I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?

see also https://stackoverflow.com/questions/44494656/coalesce-reduces-parallelism-of-entire-stage-spark — Raphael Roth, Sep 12 '21 at 19:04
Differences between `coalesce(1)` and `repartition(1)` may be experienced as a result of older Spark version: https://stackoverflow.com/questions/71154277 — ZygD, Mar 05 '22 at 20:33

score 7 · Answer 1 · answered Feb 14 '22 at 20:00

coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.

The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)

score 3 · Accepted Answer · answered Sep 12 '21 at 08:33

You state nothing else in terms of logic.

coalesce will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.

Difference between repartition(1) and coalesce(1)

2 Answers2

Linked