can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?
I have a dataset (91MB) that is divided into 14 partitions (~6.5MB each). I did 2 tests:
- test 1 - I loaded this dataset using 2 executors, 2 cores each
- test 2 - I loaded this dataset using 4 executors, 2 cores each
Results:
- test 1 - Spark created 5 tasks to read data (in each task ~18 MB was loaded)
- test 2 - Spark created 7 tasks to read data (in each task ~13 MB was loaded)
I don't see any regularity here. I see that Spark somehow reduces the number of partitions, but by what rule? Could someone help?