How does spark determine the number of tasks?

Question

I am a bit confused by the number of tasks that are created by Spark when reading a number of text files.

Here is the code:

val files = List["path/to/files/a/23", 
                 "path/to/files/b/", 
                 "path/to/files/c/0"]
val ds = spark.sqlContext.read.textFile(files :_*)
ds.count()

Each of the folders a, b, c contains 24 files, so that there are a total of 26 files since the complete b folder is read. Now if I execute an action, like .count(), the Spark UI shows me that there are 24 tasks. However, I would have thought that there are 26 tasks, as in 1 task per partition and 1 partition for each file.

It would be great if someone could give me some more insights into what is actually happening.

What is the total number of cores you are using for the job? And have you set any configurations? — Simon Schiff, Nov 18 '16 at 14:47
To make it easy to understand. Can you post your code and your spark ui details. — Thiago Baldim, Nov 18 '16 at 14:48
@SimonSchiff I used 8 cores and I have nothing configure that I know of. However, that seems to be the correct direction. I tried executing the code on a bigger machine and it had the expected 26 tasks. — Dominik Müller, Nov 18 '16 at 15:15

score 2 · Answer 1 · edited May 23 '17 at 12:08

Text files are loaded using Hadoop formats. Number of partitions depends on:

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
minPartitions argument if provided
block size
compression if present (splitable / not-splitable).

You'll find example computations here: Behavior of the parameter "mapred.min.split.size" in HDFS

How does spark determine the number of tasks?

1 Answers1