0

I have a pyspark dataframe with 1.6million records. I sorted it and then group by hoping the sorting order will be preserved so that I can select the last value of the sorted column in the group by. However, it seems like the sorting order is not necessarily preserved during the group. Should I use pyspark Window instead of a sort and group?

output_data = input_data.sort(F.col("id"))\

                .sort(F.col("date").asc())\

                .groupBy("id").agg(F.last("date").alias("date"))
SUNIL DHAPPADHULE
  • 2,755
  • 17
  • 32
sammanic
  • 1
  • 1
  • sammanic - welcome to Stackoverflow. Yes. you need to use Window function and partitionBy on "id" and orderBy on "date". – Shantanu Sharma May 16 '19 at 11:49
  • Shan - Thank you for confirming that I need a Window function. Is this a known issue in pyspark that the sorting order is not being preserved by a subsequent groupby command. – sammanic May 16 '19 at 13:23
  • This is not an issue. You requirement is to create partitions and sort data within those partitions. Window functions are designed for that purpose. – Shantanu Sharma May 16 '19 at 13:46
  • Possible duplicate of [Spark DataFrame: does groupBy after orderBy maintain that order?](https://stackoverflow.com/questions/39505599/spark-dataframe-does-groupby-after-orderby-maintain-that-order) – pault May 16 '19 at 19:05

0 Answers0