Sort and then groupby a dataframe - Is the sorting order preserved after the groupby?

Asked May 16 '19 at 11:40

Active May 16 '19 at 19:02

Viewed 482 times

I have a pyspark dataframe with 1.6million records. I sorted it and then group by hoping the sorting order will be preserved so that I can select the last value of the sorted column in the group by. However, it seems like the sorting order is not necessarily preserved during the group. Should I use pyspark Window instead of a sort and group?

output_data = input_data.sort(F.col("id"))\

                .sort(F.col("date").asc())\

                .groupBy("id").agg(F.last("date").alias("date"))

edited May 16 '19 at 19:02

SUNIL DHAPPADHULE

2,755
17
32

asked May 16 '19 at 11:40

sammanic

sammanic - welcome to Stackoverflow. Yes. you need to use Window function and partitionBy on "id" and orderBy on "date". – Shantanu Sharma May 16 '19 at 11:49
Shan - Thank you for confirming that I need a Window function. Is this a known issue in pyspark that the sorting order is not being preserved by a subsequent groupby command. – sammanic May 16 '19 at 13:23
This is not an issue. You requirement is to create partitions and sort data within those partitions. Window functions are designed for that purpose. – Shantanu Sharma May 16 '19 at 13:46
Possible duplicate of [Spark DataFrame: does groupBy after orderBy maintain that order?](https://stackoverflow.com/questions/39505599/spark-dataframe-does-groupby-after-orderby-maintain-that-order) – pault May 16 '19 at 19:05

Sort and then groupby a dataframe - Is the sorting order preserved after the groupby?

0 Answers0

Linked