0

On what basis repartition(value) is calculated based on the data size or something other then that?

If suppose there are 100milion records in a particular dataframe what value we have to give in repartition? And how to calculate the value?

  • 1
    Does this answer your question? [How to calculate the best numberOfPartitions for coalesce?](https://stackoverflow.com/questions/40865326/how-to-calculate-the-best-numberofpartitions-for-coalesce) – user10938362 Dec 05 '19 at 19:10

1 Answers1

0

Nobody can tell you exactly how many partitions you need. The optimal number of records depends on number of records, size of each record and what you want to do with it.

I would suggest start with say 200 (default value of spark.sql.shuffle.partitions) and check whether you have any problems with memory, also check tasks duration and peak execution memory in Spark UI. Depending on your findings increase or decrease the number of partitions. You may also want to run some simple benchmarks. Make sure you always have (several times) more partitions than #executors*cores_per_executors

Note that if you make value depending on data size, you need to query your data twice, most often this is not needed, sometimes it make sense (if the workload differs drastically). Often this can be change by crude approximations (based on the input parameter of your query, i.e. whether it's a (initial) full load, update-job etc

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • I'm sorry Raphael, I think that link provided "kinda" answer the question although I agree with you. – eliasah Dec 06 '19 at 15:26