Pandas Group by multiple columns and select non null last values from non-grouped cols

Question

Hi I have an excel sheet with fields like this

FName, LastName, DOB, BirthPlace, Address, Design

John,  Cash,          21-09-1986, Darwin, , , 
John,  Cash,             , ,   , 22 Howard Springs Darwin, 
John,  Cash,        ,          , 20 Howard Springs Darwin     , Supervisor

So I want to groupby Fname, LastName, DOB, Birth Place but then I want pick only "last" non null fields off of other columns like Address and Desig. Also please note that I won't get all the col values in every row though there can be many updates for the same person. Like in 2nd row above I don't have DOB, BirthPlace, Desig. I can have many such rows with only one col updated each item but names and dob fields present most of the time.

I tried to do it this way

 df2 = input_df.groupby(['Fname','LastName','BirthPlace','DOB']).agg({ 'Address':'last',
                             'Desig':'last' 
                              }).reset_index()

But I get none or nulls in the output.

Expected output for me should be : (address updated and designation set as well to last non null value in those columns)

John,  Cash,21-09-1986, Darwin, 20 Howard Springs Darwin  , Supervisor

jezrael · Answer 1 · 2021-10-14T04:51:54.483

0

If there is possible defined groups by only Fname,LastName solution should be simplify by GroupBy.last for last non NaNs values:

#if necessary replace empty spaces to NaNs
#input_df = input_df.replace('', np.nan)

#if there are spaces or empty strings only
#https://stackoverflow.com/a/47810911/2901002
input_df = input_df.replace(r'^\s+$', np.nan, regex=True)     
df2 = input_df.groupby(['FName','LastName'], as_index=False).last()
print(df2)
  FName LastName         DOB BirthPlace                   Address       Desig
0  John     Cash  21-09-1986     Darwin  20 Howard Springs Darwin  Supervisor

edited Oct 14 '21 at 04:51

answered Oct 13 '21 at 06:09

jezrael

822,522
95
1,334
1,252

Doesn't replace.(' ', np.nan) remove rows fully, I mean I will lose other columns that have data in them ? – Tabber Oct 13 '21 at 21:26
@Tabber - No, it not remove rows, only replace if only spaces to `NaN` values, so `last` get last non `NaN`s. – jezrael Oct 14 '21 at 04:52

EBDS · Answer 2 · 2021-10-13T07:25:50.443

This worked for me.

df.replace('',np.nan).groupby(['FN','LN'], as_index=False).agg(
    {'DOB':lambda x: x.dropna().tail(1),
        'BirthPlace': lambda x: x.dropna().tail(1), 
        'Address': lambda x: x.dropna().tail(1),
        'Desig': lambda x: x.dropna().tail(1) }
)

Pardon the use of tail(1) instead of last(). Don't know why it didn't work. But I think you get the picture. Extend your agg with a lambda function to get rid of the blanks first before doing last.

Pandas Group by multiple columns and select non null last values from non-grouped cols

2 Answers2