2

I would like to filter my dataframe to only keep rows with max value in column some_date.

df.filter(F.col('some_date') = F.max('some_date')) fails, as max is not used in aggregate.

I also tried to just get the max_date value to then use it in filter: max_date = df.groupBy().max('some_date'), which failed telling me that "some_date" is not a numeric column. Aggregation function can only be applied on a numeric column.

In SQL, I would achieve this with a subquery (to the effect of where some_date = (select max(some_date) from ...), but I thought there would be a better way to structure it in Python.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
3yakuya
  • 2,622
  • 4
  • 25
  • 40
  • df.groupby().agg({"some_date":"max"}).show() ?? – frank Nov 30 '18 at 10:48
  • I think you want an aggregate window over dates https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html – OneCricketeer Nov 30 '18 at 10:50
  • 1
    Possible duplicate of [how to get max(date) from given set of data grouped by some fields using pyspark?](https://stackoverflow.com/q/38377894/10465355) – 10465355 Nov 30 '18 at 10:56

0 Answers0