If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce? I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?
Asked
Active
Viewed 155 times
0
-
2Does this answer your question? [Spark - repartition() vs coalesce()](https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce) – Robert Kossendey Sep 30 '21 at 13:26
1 Answers
0
we can't decide this based on specific parameter there will be multiple factors are there to decide how many partitions and repartition or coalesce *based on the size of data , if size of the file is too big you can give 2 or 3 partitions per block to increase the performance but if give more too many partitions it split as small files .In Big data small files will lower performance . 1 Block (128 MB) --> 128/2 = 64MB each partition ,So 1 mapper will run for 64 MB *based on the cluster size , if you have more number of executors/cores are free you can give according to that. *repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.

C Kondaiah
- 72
- 8