4

I am a bit confused by the number of tasks that are created by Spark when reading a number of text files.

Here is the code:

val files = List["path/to/files/a/23", 
                 "path/to/files/b/", 
                 "path/to/files/c/0"]
val ds = spark.sqlContext.read.textFile(files :_*)
ds.count()

Each of the folders a, b, c contains 24 files, so that there are a total of 26 files since the complete b folder is read. Now if I execute an action, like .count(), the Spark UI shows me that there are 24 tasks. However, I would have thought that there are 26 tasks, as in 1 task per partition and 1 partition for each file.

It would be great if someone could give me some more insights into what is actually happening.

Dominik Müller
  • 575
  • 1
  • 9
  • 16
  • 2
    What is the total number of cores you are using for the job? And have you set any configurations? – Simon Schiff Nov 18 '16 at 14:47
  • To make it easy to understand. Can you post your code and your spark ui details. – Thiago Baldim Nov 18 '16 at 14:48
  • @SimonSchiff I used 8 cores and I have nothing configure that I know of. However, that seems to be the correct direction. I tried executing the code on a bigger machine and it had the expected 26 tasks. – Dominik Müller Nov 18 '16 at 15:15

1 Answers1

2

Text files are loaded using Hadoop formats. Number of partitions depends on:

  • mapreduce.input.fileinputformat.split.minsize
  • mapreduce.input.fileinputformat.split.maxsize
  • minPartitions argument if provided
  • block size
  • compression if present (splitable / not-splitable).

You'll find example computations here: Behavior of the parameter "mapred.min.split.size" in HDFS

Community
  • 1
  • 1