Koalas sort_index increase spark partitions

Asked Jan 02 '21 at 22:31

Active Jan 02 '21 at 22:31

Viewed 116 times

I'm new to koalas and I was surprised that when I use the method sort_index() and sort_values() the spark partition increase automatically.

Example:

import databricks.koalas as ks
df = ks.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                  'D': ['D2', np.nan, 'D6', 'D7'],
                  'F': ['F2', 'F3', 'F6', 'F7']},
                 index=[0, 3, 6, 7])

print(df.spark.repartition(2).to_spark().rdd.getNumPartitions())

Output:

If I sort using a random column (or and index) like

print(df.spark.repartition(2).sort_values(by='B').to_spark().rdd.getNumPartitions())

Output:

Why this happens?

I also tried with a bigger dataset and the partitions increased more(from 12 to 200)

asked Jan 02 '21 at 22:31

Devilfire

`sort` does a shuffle, the upper bound can be set via `spark.sql.shuffle.partitions` option. See also https://stackoverflow.com/questions/53786188/number-of-dataframe-partitions-after-sorting – UninformedUser Jan 03 '21 at 07:55
Thank you! I don't know about that – Devilfire Jan 03 '21 at 17:58

Koalas sort_index increase spark partitions

0 Answers0