I am trying to understand the difference between coalesce()
and repartition()
.
If I correctly understood this answer, coalesce()
can only reduce number of partitions of dataframe and if we try to increase the number of partitions then number of partitions remains unchanged.
But when I tried to execute below code, I observed two things
- For Dataframe with coalesce number of partitions can be increased
- For Rdd if shuffle = false then number of partitions cannot be increase with coalesce.
Does it mean that with coalesce dataframe partitions can be increased?
Applying coalesce to dataframe
When I execute the following code:
val h1b1Df = spark.read.csv("/FileStore/tables/h1b_data.csv")
println("Original dataframe partitions = " + h1b1Df.rdd.getNumPartitions)
val coalescedDf = h1b1Df.coalesce(2)
println("Coalesced dataframe partitions = " + coalescedDf.rdd.getNumPartitions
val coalescedDf1 = coalescedDf.coalesce(6)
println("Coalesced dataframe with increased partitions = " + coalescedDf1.rdd.getNumPartitions)
I get the following output
Original dataframe partitions = 8
Coalesced dataframe partitions = 2
Coalesced dataframe with increased partitions = 6
Applying coalesce to RDD
When I execute the following code:
val inpRdd = h1b1Df.rdd
println("Original rdd partitions = " + inpRdd.getNumPartitions)
val coalescedRdd = inpRdd.coalesce(4)
println("Coalesced rdd partitions = " + coalescedRdd.getNumPartitions)
val coalescedRdd1 = coalescedRdd.coalesce(6, false)
println("Coalesced rdd with increased partitions = " + coalescedRdd1.getNumPartitions)
I get the following output:
Original rdd partitions = 8
Coalesced rdd partitions = 4
Coalesced rdd with increased partitions = 4