0

I have credit loan data, but the original df has many loan ids that can be under one customer. thus I need to group by client id in order to build client profile.

the original df:

contract_id',  'product_id','client_id','bal','age', 'gender',  'pledge_amount', 'branche_region

RZ13/25        000345         98023432  2300   32      M            4500           'west'


clients = df.groupby(by=['client_id']).median().reset_index()

This line completely removes important categories like gender, branch region! It groups by client_id and calculates median for NUMERIC columns. all other categorical columns are gone.

I wonder how to group by unique customers but also keep the categoricals..

ERJAN
  • 23,696
  • 23
  • 72
  • 146
  • 1
    In general you could include the rest of the columns in groupby ... `df.groupby(by=['client_id', 'gender'], as_index = False).median()` ...... . Assuming this would not alter the groups. You can also consult `DataFrame.agg` to create other functions for the category columns. https://stackoverflow.com/questions/12589481/multiple-aggregations-of-the-same-column-using-pandas-groupby-agg – ansev May 18 '20 at 10:38

1 Answers1

1

It is removed, because pandas remove nuisance columns.

For avoid it is necessary aggregate each column, here for numeric are aggregated means and for non numeric is returned first value:

f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.iat[0]
#another idea with join non numeric values
#f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else ','.join(x)
clients = df.groupby(by=['client_id']).agg(f)

If values of another non numeri columns are same per groups is possible add them to list for by parameter:

clients = df.groupby(by=['client_id', 'gender', 'branche_region']).median()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252