1

Here is my spark job stages: enter image description here

It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about
50MB and it is stored in gzip format

I tried using the following settings to reduce the tasks but it didn't work.

...
--conf spark.sql.mergeSmallFileSize=10485760 \
--conf spark.hadoopRDD.targetBytesInPartition=134217728 \
--conf spark.hadoopRDD.targetBytesInPartitionInMerge=134217728 \
...

Is it because file format is gzip that made it cannot be merged?

How can I do now if I want to reduce the job tasks?

Vincent Doba
  • 4,343
  • 3
  • 22
  • 42
xyfs
  • 11
  • 1
  • This is classic small file problem, Either repartitioning your data to less number of partitions, or using adaptive framework `spark.sql.adaptive.enabled true` should work – palamuGuy Jul 02 '23 at 18:33

0 Answers0