I have a dask dataframe containing following format:
import pandas as pd
import numpy as np
import dask.dataframe as dd
df = pd.DataFrame({'ID': [1, 1, 2, 3], 'Value': ['ABC', 'ABD', 'CDE', 'DEF'], 'Date': ['2020-10-10', '2019-10-12', '2019-01-08', np.nan]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['Date'] = dd.to_datetime(ddf['Date'], dayfirst=True) # Convert to proper dtype
ddf.head()
Output:
| ID | Value | Date
-------------------------
0 | 1. | ABC. | 2020-10-10
1 | 1. | ABD. | 2019-10-12
2 | 2. | CDE. | 2019-01-08
3 | 3. | DEF. | NaT
I need to pick first record in each group sorted by Date and grouped by ID. How to achieve this in dask and as well as pandas if possible.
Output:
ID | Value | Date
-----------------------
1. | ABD. | 2019-10-12
2. | CDE. | 2019-01-08
3. | DEF. | NaT
What I tried:
Using: Get the Row(s) which have the max count in groups using groupby
ddf.set_index('Date').drop_duplicates('ID').head()
# Error: TypeError: '<' not supported between instances of 'NoneType' and 'int'
ddf.loc[ddf.groupby('ID')['Date'].idxmax()].head()
# Error: ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
Please test and post the answer since many answers are not working as expected.