3

I have a parquet directory having 5 files as shown below:

enter image description here

I am using Spark 2.2 version and reading this directory using below code:

enter image description here

I am not clear why 7 partitions (alternateDF.rdd().getNumPartitions()) are being determined by Spark when we have 5 files (each less than block size) in the parquet directory? 5 tasks have input records but the last 2 tasks have 0 input records but non-zero input data. Could you please explain the behavior of each task?

enter image description here

Aman
  • 475
  • 2
  • 6
  • 10

1 Answers1

0

@Aman,

You can follow an old question link

simply put following are the 3 parameters it depends (from above link) on to calculate number of partitions

  • spark.default.parallelism (roughly translates to #cores available for the application)
  • spark.sql.files.maxPartitionBytes (default 128MB)
  • spark.sql.files.openCostInBytes (default 4MB)

Spark source code to refer

Naga
  • 416
  • 3
  • 11
  • Thanks for your response but I am not able to get this number from above logic. To simplify the question further I tried using CSV files as input to spark.read function. Observations - 5 files in CSV directory (each file of size 40 MB approx) gave me 10 partitions 3 Files in CSV directory (each file of size 60 MB approx) gave me 9 partitions Unable to understand this Spark behavior – Aman Nov 11 '19 at 04:51
  • spark.default.parallelism is 8 and spark.sql.files.maxPartitionBytes is 128MB – Aman Nov 11 '19 at 06:08
  • Did you tested with only one big file and seeing same behavior? or is it only happening with more than one file which less than 128MB? – Naga Nov 11 '19 at 16:04
  • 1
    I did read one CSV file with 233 MB size, it gave me 2 partitions. But when I load small files then it gives me different number of partitions as mentioned in my last comment. My question is more about small file partitions. How they are computed precisely? – Aman Nov 11 '19 at 18:32