Spark SQL Window functions - manual repartitioning necessary?

Question

I am processing data partitioned by column "A" with PySpark.

Now, I need to use a window function over another column "B" to get the max value for this frame and count it up for new entries.

As it says here, "Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame."

Do I need to manually repartition the data by column "B" before applying the window, or does Spark does this automatically?

I.e. would I have to do:

data = data.repartition("B")

before:

w = Window().partitionBy("B").orderBy(col("id").desc())

Thanks a lot!

score 2 · Answer 1 · answered May 17 '21 at 18:34

If you use Window.partitionBy(someCol), then if you have not set a value for shuffle partitions parameter, then the partitioning will default to 200.

A similar but not the same post should provide guidance. spark.sql.shuffle.partitions of 200 default partitions conundrum

So, in short you need not expressly perform the repartition, the shuffle partitions parameter is more relevant.

Spark SQL Window functions - manual repartitioning necessary?

1 Answers1