I am a bit confused by the number of tasks that are created by Spark when reading a number of text files.
Here is the code:
val files = List["path/to/files/a/23",
"path/to/files/b/",
"path/to/files/c/0"]
val ds = spark.sqlContext.read.textFile(files :_*)
ds.count()
Each of the folders a, b, c
contains 24 files, so that there are a total of 26 files since the complete b
folder is read. Now if I execute an action, like .count()
, the Spark UI shows me that there are 24 tasks. However, I would have thought that there are 26 tasks, as in 1 task per partition and 1 partition for each file.
It would be great if someone could give me some more insights into what is actually happening.