Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.