I use spark-sql to read a big table and generate 100,000 tasks .
I know I can set num_of_partitions but it would do the same to small tables.
Is there any way to limit the size of each partition ?
I use spark-sql to read a big table and generate 100,000 tasks .
I know I can set num_of_partitions but it would do the same to small tables.
Is there any way to limit the size of each partition ?
Currently Spark doesn't support limit of partition size. You need to set the partition number to a smaller number if you want to reduce the number of tasks.
Now the trick generally used is to dynamically set the number of partition based on the datasize. Generally you want your partition to be equivalent to the HDFS block size (128MB). If you know the size of each row of your data you can estimate how many rows you want to keep per partition. Lets say its value is X.
Then you can set the num_of_partitions to be
dataframe.count / x