Dask dataframe: Get first row of each sorted group

Question

I have a dask dataframe containing following format:

import pandas as pd
import numpy as np
import dask.dataframe as dd

df = pd.DataFrame({'ID': [1, 1, 2, 3], 'Value': ['ABC', 'ABD', 'CDE', 'DEF'], 'Date': ['2020-10-10', '2019-10-12', '2019-01-08', np.nan]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['Date'] = dd.to_datetime(ddf['Date'], dayfirst=True) # Convert to proper dtype
ddf.head()

Output:

  | ID | Value | Date
-------------------------
0 | 1. | ABC.  | 2020-10-10
1 | 1. | ABD.  | 2019-10-12
2 | 2. | CDE.  | 2019-01-08
3 | 3. | DEF.  | NaT

I need to pick first record in each group sorted by Date and grouped by ID. How to achieve this in dask and as well as pandas if possible.

Output:

ID | Value | Date
-----------------------
1. | ABD.  | 2019-10-12
2. | CDE.  | 2019-01-08
3. | DEF.  | NaT

What I tried:

Using: Get the Row(s) which have the max count in groups using groupby

ddf.set_index('Date').drop_duplicates('ID').head()
# Error: TypeError: '<' not supported between instances of 'NoneType' and 'int'

ddf.loc[ddf.groupby('ID')['Date'].idxmax()].head()
# Error: ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Please test and post the answer since many answers are not working as expected.

Stef · Accepted Answer · 2020-12-22T12:13:41.340

1

Dask

ddf.set_index(ddf.Date.fillna(pd.to_datetime('2262-04-11'))).drop_duplicates('ID').set_index('ID').reset_index().compute()
#   ID Value       Date
#0   1   ABD 2019-10-12
#1   2   CDE 2019-01-08
#2   3   DEF        NaT

(2262-04-11 is the maximum date for datetime64[ns])

Pandas

df.sort_values(['ID', 'Date']).drop_duplicates('ID')
#   ID Value        Date
#1   1   ABD  2019-10-12
#2   2   CDE  2019-01-08
#3   3   DEF         NaN

edited Dec 22 '20 at 12:13

answered Dec 22 '20 at 10:54

Stef

28,728
2
24
52

Seems to be working. One question: Filling NaT with 2262-04-11 makes NaT as the biggest date than any non-NaT but what if I wanted to be smallest? – Nihhaar Dec 22 '20 at 13:12
in this case you can use `'1677-09-21'` (`pd.Timestamp.min`) but I'm not sure if this really makes sense: any `NaT`-value will take precedence, i.e. as soon there's any `NaT` for the given `ID` the result will be `NaT`. – Stef Dec 22 '20 at 15:29

Dask dataframe: Get first row of each sorted group

1 Answers1