Extract first record of each group dataframe pandas

Question

There's an excel file with above 200,000 rows and I would like to extract the first row only from each group (the groups are in the third column. I have read the dataset and sorted the values by two columns. Now I just need to create a new dataframe with the first row of each group

This is sample of the excel file and here's my attempt till now

import pandas as pd

df = pd.read_excel('Example.xlsx', sheet_name='Sheet1')
df['Date']= pd.to_datetime(df['Date'])
df = df.sort_values(['F. No.', 'Date'], ascending=[True, False])
print(df.head())

So I need to extract the four columns starting from F. No. to Emp. (the most recent records only for each group)

How about [Pandas dataframe get first row of each group - Stack Overflow](https://stackoverflow.com/questions/20067636/pandas-dataframe-get-first-row-of-each-group)? — Ynjxsjmh, Apr 17 '22 at 13:52
Thanks a lot. I used this line and it worked well `new_df = df.groupby('F. No.').first()` but the column index changed!! — YasserKhalil, Apr 17 '22 at 13:56
You might also want to consider just adding `.drop_duplicates('F. No.')` after your sort — Jon Clements, Apr 17 '22 at 13:59
I tried `new_df = df.groupby('F. No.', as_index=False).first()` but the same problem as for the column index point. — YasserKhalil, Apr 17 '22 at 14:01
This line `.drop_duplicates('F. No.')` is amazing. Thanks a lot. — YasserKhalil, Apr 17 '22 at 14:02

SultanOrazbayev · Accepted Answer · 2022-04-17T15:20:07.663

This might help:

import pandas as pd

df = pd.read_excel('Example.xlsx', sheet_name='Sheet1')
df['Date']= pd.to_datetime(df['Date'])
df = df.sort_values(['F. No.', 'Date'], ascending=[True, False])
df_first = df.groupby(['F. No.'], as_index=False).head(1)

To make sure that the groupby column does not become an index, pass as_index=False kwarg. Note that .head(1) works because the data is sorted in the previous line.

Extract first record of each group dataframe pandas

1 Answers1