On what basis repartition(value) is calculated based on the data size or something other then that?
If suppose there are 100milion records in a particular dataframe what value we have to give in repartition? And how to calculate the value?
On what basis repartition(value) is calculated based on the data size or something other then that?
If suppose there are 100milion records in a particular dataframe what value we have to give in repartition? And how to calculate the value?
Nobody can tell you exactly how many partitions you need. The optimal number of records depends on number of records, size of each record and what you want to do with it.
I would suggest start with say 200 (default value of spark.sql.shuffle.partitions
) and check whether you have any problems with memory, also check tasks duration and peak execution memory in Spark UI. Depending on your findings increase or decrease the number of partitions. You may also want to run some simple benchmarks. Make sure you always have (several times) more partitions than #executors*cores_per_executors
Note that if you make value
depending on data size, you need to query your data twice, most often this is not needed, sometimes it make sense (if the workload differs drastically). Often this can be change by crude approximations (based on the input parameter of your query, i.e. whether it's a (initial) full load, update-job etc