repartition(columnName)
per default creates 200 partitions (more specific, spark.sql.shuffle.partitions
partitions), no matter how many unique values of col1
there are. If there is only 1 unique value of col1
, then 199 of the partitions are empty. On the other hand, if you have more than 200 unique values of col1
, you will will have multiple values of col1
per partition.
If you only want 1 partition, then you can do repartition(1,col("col1"))
or just coalesce(1)
. But not that coalesce
does not behave the same in the sense that coalesce
me be moved further up in your code suscht that you may loose parallelism (see How to prevent Spark optimization)
If you want to check the content of your partition, I've made 2 methods for this:
// calculates record count per partition
def inspectPartitions(df: DataFrame) = {
import df.sqlContext.implicits._
df.rdd.mapPartitions(partIt => {
Iterator(partIt.toSeq.size)
}
).toDF("record_count")
}
// inspects how a given key is distributed accross the partition of a dataframe
def inspectPartitions(df: DataFrame, key: String) = {
import df.sqlContext.implicits._
df.rdd.mapPartitions(partIt => {
val part = partIt.toSet
val partSize = part.size
val partKeys = part.map(r => r.getAs[Any](key).toString.trim)
val partKeyStr = partKeys.mkString(", ")
val partKeyCount = partKeys.size
Iterator((partKeys.toArray,partSize))
}
).toDF("partitions","record_count")
}
Now you can e.g. check your dataframe like this:
inspectPartitions(df.repartition(col("col1"),"col1")
.where($"record_count">0)
.show