The data is simply a collection of ids and their login date like this:
data = pd.DataFrame({'id': ['a', 'b', 'c', 'b', 'c'],
'date': ['2017/12/10', '2017/12/10', '2017/12/11', '2017/12/12', '2017/12/12']})
id | date
---------------
a | 2017/12/10
b | 2017/12/10
c | 2017/12/11
b | 2017/12/12
c | 2017/12/12
Each id may have multiple records. With Pandas, if I want to single-out only the most recent record of each id, I would do this:
most_recent = data.sort_values('date', ascending=False).groupby('id').head(1)
How do I achieve the same thing with a PySpark Dataframe?
I've tried something like this:
data.orderBy(data.date, ascending=False).groupBy('id')
But because I don't need to use any aggregation function after this, I am stuck.
I realize I could turn PySpark dataframe to a Pandas dataframe, but I would like to know how to do it with PySpark.