0

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?

I have a dataset (91MB) that is divided into 14 partitions (~6.5MB each). I did 2 tests:

  • test 1 - I loaded this dataset using 2 executors, 2 cores each
  • test 2 - I loaded this dataset using 4 executors, 2 cores each

Results:

  • test 1 - Spark created 5 tasks to read data (in each task ~18 MB was loaded)
  • test 2 - Spark created 7 tasks to read data (in each task ~13 MB was loaded)

I don't see any regularity here. I see that Spark somehow reduces the number of partitions, but by what rule? Could someone help?

Pawel
  • 1

1 Answers1

0

Spark would need to create total of 14 tasks to process the file with 14 partitions. Each task will be assigned to a partition per stage.

Now, if you have provided more resources, the spark will parallelize the tasks more. So you would see more tasks are started when the spark starts processing. However, those tasks will finish and a new set of tasks will start depending on the resources you have provided. Overall, the spark will fork 14 tasks to process the file.

Spark won't reduce the partitions of the file unless you repartition the file or coalesce.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
  • Part(1/2) Thank you for your answer. I understand that spark will create 14 tasks and will not reduce the number of partitions unless I repartition or coalesce. But: 1) 'Overall, the spark will fork 14 tasks to process the file' - if so, why do I see only 5 tasks (test 1) in the UI? 2) 'If you have provided more resources, the spark will parallelize the tasks more' - if I have 2 executors with 2 cores each, then according to my understanding, 4 partitions can be processed in parallel (assuming that a single core can only perform one task at a time). How can spark achieve higher parallelism? – Pawel Jan 22 '23 at 09:54
  • Part(2/2) I expected to see 14 tasks in the UI but I see less. If it's like you mentioned 'spark parallelizes the tasks more', is it possible to predict how many tasks I'll see in the UI before starting the job? If spark is doing some extra optimization that can't be clearly explained, it's like a black box to me. I would appreciate a clearer explanation. Thank you! – Pawel Jan 22 '23 at 09:56
  • The unit of execution in spark is a task and each task runs on a single executor at a time. So, if you have provided 2 executors, and 2 cores, so a max of 4 tasks at a time will run in both executors. When those two tasks are finished, the same executors will take another set of tasks and process them. Spark may not create all the tasks at the same time, it might do some scheduling but overall it will perform 14 tasks to process your file. Unless your data size is small, in that case spark could also coalesce the partitions internally based on `defaultMinPartitions` parameter. – Avishek Bhattacharya Jan 22 '23 at 11:09
  • It is very hard to predict how many tasks spark will eventually create. There are many many optimizations spark does and it is not always easy to predict. If you want to achieve higher parallelism, you need to provide more executors to spark. - Ensure your partitions are not very small - Ensure you have enough partitions in the data - Provide more executors to spark This will ensure spark creates more parallelism. – Avishek Bhattacharya Jan 22 '23 at 11:12
  • Thank you Avishek. I generally agree with everything you wrote - the total number of tasks, the relationship between tasks and cores, and how to increase parallelism. The only thing that unfortunately is still not clear to me is why I only see 5 tasks in the UI even though spark does 14 tasks. – Pawel Jan 22 '23 at 16:04