0

I'm new to koalas and I was surprised that when I use the method sort_index() and sort_values() the spark partition increase automatically.

Example:

import databricks.koalas as ks
df = ks.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                  'D': ['D2', np.nan, 'D6', 'D7'],
                  'F': ['F2', 'F3', 'F6', 'F7']},
                 index=[0, 3, 6, 7])

print(df.spark.repartition(2).to_spark().rdd.getNumPartitions())

Output:

2

If I sort using a random column (or and index) like

print(df.spark.repartition(2).sort_values(by='B').to_spark().rdd.getNumPartitions())

Output:

4

Why this happens?

I also tried with a bigger dataset and the partitions increased more(from 12 to 200)

Devilfire
  • 13
  • 3

0 Answers0