how to save categorical columns when doing groupby.median()?

Question

I have credit loan data, but the original df has many loan ids that can be under one customer. thus I need to group by client id in order to build client profile.

the original df:

contract_id',  'product_id','client_id','bal','age', 'gender',  'pledge_amount', 'branche_region

RZ13/25        000345         98023432  2300   32      M            4500           'west'


clients = df.groupby(by=['client_id']).median().reset_index()

This line completely removes important categories like gender, branch region! It groups by client_id and calculates median for NUMERIC columns. all other categorical columns are gone.

I wonder how to group by unique customers but also keep the categoricals..

In general you could include the rest of the columns in groupby ... `df.groupby(by=['client_id', 'gender'], as_index = False).median()` ...... . Assuming this would not alter the groups. You can also consult `DataFrame.agg` to create other functions for the category columns. https://stackoverflow.com/questions/12589481/multiple-aggregations-of-the-same-column-using-pandas-groupby-agg — ansev, May 18 '20 at 10:38

jezrael · Answer 1 · 2020-05-18T10:42:04.657

1

It is removed, because pandas remove nuisance columns.

For avoid it is necessary aggregate each column, here for numeric are aggregated means and for non numeric is returned first value:

f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.iat[0]
#another idea with join non numeric values
#f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else ','.join(x)
clients = df.groupby(by=['client_id']).agg(f)

If values of another non numeri columns are same per groups is possible add them to list for by parameter:

clients = df.groupby(by=['client_id', 'gender', 'branche_region']).median()

edited May 18 '20 at 10:42

answered May 18 '20 at 10:38

jezrael

822,522
95
1,334
1,252

oh now i see why they include other cols in "by=" – ERJAN May 18 '20 at 10:40
@ERJAN - yop, it depends of data, sometimes not possible, sometimes yes. – jezrael May 18 '20 at 10:41

how to save categorical columns when doing groupby.median()?

1 Answers1