3

The data is simply a collection of ids and their login date like this:

data = pd.DataFrame({'id': ['a', 'b', 'c', 'b', 'c'], 
                   'date': ['2017/12/10', '2017/12/10', '2017/12/11', '2017/12/12', '2017/12/12']})

id | date
---------------
 a | 2017/12/10
 b | 2017/12/10
 c | 2017/12/11
 b | 2017/12/12
 c | 2017/12/12

Each id may have multiple records. With Pandas, if I want to single-out only the most recent record of each id, I would do this:

most_recent = data.sort_values('date', ascending=False).groupby('id').head(1)

How do I achieve the same thing with a PySpark Dataframe?

I've tried something like this:

data.orderBy(data.date, ascending=False).groupBy('id')

But because I don't need to use any aggregation function after this, I am stuck.

I realize I could turn PySpark dataframe to a Pandas dataframe, but I would like to know how to do it with PySpark.

CDspace
  • 2,639
  • 18
  • 30
  • 36
blackszu
  • 61
  • 1
  • 1
  • 7

0 Answers0