How to filter to max date in Pyspark?

Asked Nov 30 '18 at 10:45

Active Nov 30 '18 at 12:59

Viewed 1.2k times

I would like to filter my dataframe to only keep rows with max value in column some_date.

df.filter(F.col('some_date') = F.max('some_date')) fails, as max is not used in aggregate.

I also tried to just get the max_date value to then use it in filter: max_date = df.groupBy().max('some_date'), which failed telling me that "some_date" is not a numeric column. Aggregation function can only be applied on a numeric column.

In SQL, I would achieve this with a subquery (to the effect of where some_date = (select max(some_date) from ...), but I thought there would be a better way to structure it in Python.

edited Nov 30 '18 at 10:47

OneCricketeer

179,855
19
132
245

asked Nov 30 '18 at 10:45

3yakuya

2,622
4
25
40

df.groupby().agg({"some_date":"max"}).show() ?? – frank Nov 30 '18 at 10:48
I think you want an aggregate window over dates https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html – OneCricketeer Nov 30 '18 at 10:50
1

Possible duplicate of [how to get max(date) from given set of data grouped by some fields using pyspark?](https://stackoverflow.com/q/38377894/10465355) – 10465355 Nov 30 '18 at 10:56

How to filter to max date in Pyspark?

0 Answers0