0

I use spark-sql to read a big table and generate 100,000 tasks .

I know I can set num_of_partitions but it would do the same to small tables.

Is there any way to limit the size of each partition ?
Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
no123ff
  • 307
  • 5
  • 16

1 Answers1

0

Currently Spark doesn't support limit of partition size. You need to set the partition number to a smaller number if you want to reduce the number of tasks.

Now the trick generally used is to dynamically set the number of partition based on the datasize. Generally you want your partition to be equivalent to the HDFS block size (128MB). If you know the size of each row of your data you can estimate how many rows you want to keep per partition. Lets say its value is X.

Then you can set the num_of_partitions to be

dataframe.count / x 
Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
  • But I want to use same code on different tables, big table need more partition than small tables. – no123ff Dec 26 '17 at 10:01
  • dataframe.count / x will set different partition number. Now the only thing is that you need to come up with the row size(x) of each table. This number could be parameterized. In that case the same code will work for big and smaller tables – Avishek Bhattacharya Dec 26 '17 at 10:02
  • oh, I know what you mean...but count would be too expensive if the number is too big,right? – no123ff Dec 26 '17 at 10:07
  • Count generally not an expensive operation as it doesn't need to load all the data . So after reading the data if you count the number of elements it won't be at all expensive compare to the job time. – Avishek Bhattacharya Dec 26 '17 at 11:17