Why 7 partitions are being determined by Spark?

Question

I have a parquet directory having 5 files as shown below:

I am using Spark 2.2 version and reading this directory using below code:

I am not clear why 7 partitions (alternateDF.rdd().getNumPartitions()) are being determined by Spark when we have 5 files (each less than block size) in the parquet directory? 5 tasks have input records but the last 2 tasks have 0 input records but non-zero input data. Could you please explain the behavior of each task?

score 0 · Answer 1 · answered Oct 29 '19 at 15:39

0

@Aman,

You can follow an old question link

simply put following are the 3 parameters it depends (from above link) on to calculate number of partitions

spark.default.parallelism (roughly translates to #cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)

Spark source code to refer

answered Oct 29 '19 at 15:39

Naga

416
3
11

Thanks for your response but I am not able to get this number from above logic. To simplify the question further I tried using CSV files as input to spark.read function. Observations - 5 files in CSV directory (each file of size 40 MB approx) gave me 10 partitions 3 Files in CSV directory (each file of size 60 MB approx) gave me 9 partitions Unable to understand this Spark behavior – Aman Nov 11 '19 at 04:51
spark.default.parallelism is 8 and spark.sql.files.maxPartitionBytes is 128MB – Aman Nov 11 '19 at 06:08
Did you tested with only one big file and seeing same behavior? or is it only happening with more than one file which less than 128MB? – Naga Nov 11 '19 at 16:04
1

I did read one CSV file with 233 MB size, it gave me 2 partitions. But when I load small files then it gives me different number of partitions as mentioned in my last comment. My question is more about small file partitions. How they are computed precisely? – Aman Nov 11 '19 at 18:32

Why 7 partitions are being determined by Spark?

1 Answers1