6

I have dataframe that I am trying to group by which looks like this

Cust_ID Store_ID month lst_buy_dt1  purchase_amt    
 1       20       10     2015-10-07  100
 1       20       10     2015-10-09  200
 1       20       10     2015-10-20  100

I need the maximum of ls_buy_dt and maximum or purchase amount for each cust_ID, Store_ID combination for each month in a different dataframe. Sample ouput:

Cust_ID Stored_ID month max_lst_buy_dt tot_purchase_amt
 1       20        10      2015-10-20     400

My code is below .

aggregations = {
    'lst_buy_dt1': { # Get the max purchase date across all purchases in a month
    'max_lst_buy_dt': 'max',       
    },
    'purchase_amt': {     # Sum the purchases 
    'tot_purchase': 'sum',   # Find the max, call the result "max_date"
    }
}

grouped_at_Cust=metro_sales.groupby(['cust_id','store_id','month']).agg(aggregations).reset_index()

I am able to get the right aggregations . However the data frame contains an additional index in columns which I am not able to get rid of. Unable to show it, but here is the result from

list(grouped_at_Cust.columns.values)

[('cust_id', ''),
('store_id', ''),
('month', ''),
('lst_buy_dt1', 'max_lst_buy_dt'),
('purchase_amt', 'tot_purchase')]

Notice the hierarchy in the last 2 columns. How to get rid of it? I just need the columns max_lst_buy_dt and tot_purchase.

IanS
  • 15,771
  • 9
  • 60
  • 84
sourav
  • 179
  • 1
  • 1
  • 14

1 Answers1

8

Edit: based on your comment, you can simply drop the first level of the columns index. For instance with a more complicated aggregation:

aggregations = {
    'lst_buy_dt1': {
        'max_lst_buy_dt': 'max',       
        'min_lst_buy_dt': 'min',       
    },
    'purchase_amt': {
        'tot_purchase': 'sum',
    }
}
grouped_at_Cust = metro_sales.groupby(['cust_id', 'store_id', 'month']).agg(aggregations).reset_index()
grouped_at_Cust.columns = grouped_at_Cust.columns.droplevel(0)

Output:

             tot_purchase min_lst_buy_dt max_lst_buy_dt
0   cust_id           100     2015-10-07     2015-10-07
1     month           100     2015-10-20     2015-10-20
2  store_id           200     2015-10-09     2015-10-09

Original answer

I think your aggregations dictionary is too complicated. If you follow the documentation:

agg = {
    'lst_buy_dt1': 'max',       
    'purchase_amt': 'sum',
}
metro_sales.groupby(['cust_id','store_id','month']).agg(agg).reset_index()
Out[19]: 
      index  purchase_amt lst_buy_dt1
0   cust_id           100  2015-10-07
1     month           100  2015-10-20
2  store_id           200  2015-10-09

All you need now is to rename the columns of the result:

grouped_at_Cust.rename(columns={
    'lst_buy_dt1': 'max_lst_buy_dt', 
    'purchase_amt': 'tot_purchase'
})
IanS
  • 15,771
  • 9
  • 60
  • 84
  • Thanks. I wrote the dictionary that way because I may need multiple aggregations. for e.g. - I will need both max and min of lst_buy_dt1. What would be the best way to handle that ? – sourav Sep 19 '16 at 10:42
  • Thanks Ian. But my question is what to do if I need 2 aggregates (min and max ) of the same column ? Say in this case I need both max and min for lst_buy_dt1 .In that case i guess, i'll have to revert to the dictionary I created. Let me know if there are better ways to do it – sourav Sep 19 '16 at 11:20
  • @sourav, that was my point, my edit works with your dictionary. I have amended my question to make it clearer. – IanS Sep 19 '16 at 11:24
  • 1
    As a word of caution, `columns.droplevel(level=0)` will remove other column names at level 0, so if you are only performing aggregation on some columns but have other columns you will include (such as if you are using a groupby and want to reference each index level as it's own column, say for plotting later), using this method will require extra steps to add those names back in. In that case, using a method detailed here may be more appropriate: https://stackoverflow.com/questions/14507794/python-pandas-how-to-flatten-a-hierarchical-index-in-columns – Brendan Oct 30 '18 at 18:57