I am processing data partitioned by column "A" with PySpark.
Now, I need to use a window function over another column "B" to get the max value for this frame and count it up for new entries.
As it says here, "Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame."
Do I need to manually repartition the data by column "B" before applying the window, or does Spark does this automatically?
I.e. would I have to do:
data = data.repartition("B")
before:
w = Window().partitionBy("B").orderBy(col("id").desc())
Thanks a lot!