I ran various tests so as to look at this more empirically, in addition to looking at Range Partitioning for Sorting - which is the crux of the matter here. See How does range partitioner work in Spark?.
Having experimented with both 1 distinct value for "n" as in the example in the question, and more than 1 such distinct value for the "n", then using various dataframe sizes with df.orderBy($"n"):
- it is clear that the calculation for
determining the number of partitions that will contain ranges of data for sorting subsequently via mapPartitions,
- which is based on sampling from the existing partitions prior to computing some heuristically optimal number of partitions for these computed ranges,
- will in most cases compute and thus generate N+1 partitions, whereby partition N+1 is empty.
The fact that the extra partition allocated is nearly always empty makes me think there is a calculation error in the coding in some way, in other words a small bug imho.
I base this on the following simple test, which does return what RR I suspect would consider to be the proper number of partitions:
val df_a1 = (1 to 1).map(i => ("a",i)).toDF("n","i").cache
val df_a2 = (1 to 1).map(i => ("b",i)).toDF("n","i").cache
val df_a3 = (1 to 1).map(i => ("c",i)).toDF("n","i").cache
val df_b = df_a1.union(df_a2)
val df_c = df_b.union(df_a3)
df_c.orderBy($"n")
.rdd
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","number_of_records")
.show(100,false)
returns:
+----------------+-----------------+
|partition_number|number_of_records|
+----------------+-----------------+
|0 |1 |
|1 |1 |
|2 |1 |
+----------------+-----------------+
This boundary example calculation is rather simple. As soon as I use 1 to 2 or 1 .. N for any of the "n", the extra empty partitions results:
+----------------+-----------------+
|partition_number|number_of_records|
+----------------+-----------------+
|0 |2 |
|1 |1 |
|2 |1 |
|3 |0 |
+----------------+-----------------+
The sorting requires all data for a given "n" or set of "n" to be in the same partition.