I'm new to koalas and I was surprised that when I use the method sort_index() and sort_values() the spark partition increase automatically.
Example:
import databricks.koalas as ks
df = ks.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', np.nan, 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[0, 3, 6, 7])
print(df.spark.repartition(2).to_spark().rdd.getNumPartitions())
Output:
2
If I sort using a random column (or and index) like
print(df.spark.repartition(2).sort_values(by='B').to_spark().rdd.getNumPartitions())
Output:
4
Why this happens?
I also tried with a bigger dataset and the partitions increased more(from 12 to 200)