7

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.

I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?

ZygD
  • 22,092
  • 39
  • 79
  • 102
nagraj036
  • 165
  • 1
  • 6
  • 1
    see also https://stackoverflow.com/questions/44494656/coalesce-reduces-parallelism-of-entire-stage-spark – Raphael Roth Sep 12 '21 at 19:04
  • Differences between `coalesce(1)` and `repartition(1)` may be experienced as a result of older Spark version: https://stackoverflow.com/questions/71154277 – ZygD Mar 05 '22 at 20:33

2 Answers2

7

coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.

The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)

Nolan Barth
  • 374
  • 2
  • 8
3

You state nothing else in terms of logic.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83