How does Spark repartitioning work w.r.t to the input file partitioning?

Question

I have 2 questions:

Can we have less partitions set in a call to coalesce than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)?
As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?

Chris · Accepted Answer · 2021-04-28T14:10:24.793

1

e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?

Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)

does spark further splits the data (physically) when we repartition or change parallelism ?

repartition slices the data into partitions independently of the partitioning of the original input files.

edited Apr 28 '21 at 14:10

answered Apr 27 '21 at 13:36

Chris

Chris, my first question is to set less number of partitions than the total number of blocks – Abhinav Kumar Apr 27 '21 at 13:44
Thanks, I misunderstood. I amended my answer now. – Chris Apr 27 '21 at 13:51
Thanks Chris. I have one more confusion. Though spark.sql.shuffle.partition is set to 200, I always get random number of part files in hdfs/s3 whenever I do join. It never come equal to 200. Though documentation say that suffle.partition decides number of partition after shuffle operations(like join) – Abhinav Kumar Apr 27 '21 at 13:58
1

It is not clear from docs how the dataframe written to storage by DataFrameWriter is sliced into part files by default. However, you can control it with partitionBy, coalesce and repartition functions. https://stackoverflow.com/questions/44878294/why-spark-dataframe-is-creating-wrong-number-of-partitions may help you. – Chris Apr 27 '21 at 14:18
In spark number of partitions are equal to input splits. I have a 5MB file, when I read it as RDD and default parallelism set to 2, then I see number of partitions as 2. What is the precedence here , is it InputSplit given priority over Default Parallelism ? – Abhinav Kumar Apr 27 '21 at 16:06

1 Answers1