How can I reduce the spark tasks when I run a spark job

Asked Nov 14 '22 at 11:50

Active Nov 14 '22 at 13:22

Viewed 70 times

Here is my spark job stages: enter image description here

It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about
50MB and it is stored in gzip format

I tried using the following settings to reduce the tasks but it didn't work.

...
--conf spark.sql.mergeSmallFileSize=10485760 \
--conf spark.hadoopRDD.targetBytesInPartition=134217728 \
--conf spark.hadoopRDD.targetBytesInPartitionInMerge=134217728 \
...

Is it because file format is gzip that made it cannot be merged?

How can I do now if I want to reduce the job tasks?

edited Nov 14 '22 at 13:22

Vincent Doba

4,343
3
22
42

asked Nov 14 '22 at 11:50

xyfs

This is classic small file problem, Either repartitioning your data to less number of partitions, or using adaptive framework `spark.sql.adaptive.enabled true` should work – palamuGuy Jul 02 '23 at 18:33

How can I reduce the spark tasks when I run a spark job

0 Answers0