Spark Repartition and Coalesce

Question

If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce? I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?

Does this answer your question? [Spark - repartition() vs coalesce()](https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce) — Robert Kossendey, Sep 30 '21 at 13:26

C Kondaiah · Answer 1 · 2021-10-19T10:44:16.953

we can't decide this based on specific parameter there will be multiple factors are there to decide how many partitions and repartition or coalesce *based on the size of data , if size of the file is too big you can give 2 or 3 partitions per block to increase the performance but if give more too many partitions it split as small files .In Big data small files will lower performance . 1 Block (128 MB) --> 128/2 = 64MB each partition ,So 1 mapper will run for 64 MB *based on the cluster size , if you have more number of executors/cores are free you can give according to that. *repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.

Spark Repartition and Coalesce

1 Answers1