Get the first (or last) row of a grouped PySpark Data frame

Asked Dec 19 '17 at 17:49

Active Dec 19 '17 at 18:46

Viewed 1.0k times

The data is simply a collection of ids and their login date like this:

data = pd.DataFrame({'id': ['a', 'b', 'c', 'b', 'c'], 
                   'date': ['2017/12/10', '2017/12/10', '2017/12/11', '2017/12/12', '2017/12/12']})

id | date
---------------
 a | 2017/12/10
 b | 2017/12/10
 c | 2017/12/11
 b | 2017/12/12
 c | 2017/12/12

Each id may have multiple records. With Pandas, if I want to single-out only the most recent record of each id, I would do this:

most_recent = data.sort_values('date', ascending=False).groupby('id').head(1)

How do I achieve the same thing with a PySpark Dataframe?

I've tried something like this:

data.orderBy(data.date, ascending=False).groupBy('id')

But because I don't need to use any aggregation function after this, I am stuck.

I realize I could turn PySpark dataframe to a Pandas dataframe, but I would like to know how to do it with PySpark.

edited Dec 19 '17 at 18:12

CDspace

2,639
18
30
36

asked Dec 19 '17 at 17:49

blackszu

Get the first (or last) row of a grouped PySpark Data frame

0 Answers0