For this exercise I have a dataframe containing orders numbers, item ids, company ids, countries and count.
In practice I have many more columns. Now I want to automate some calculations for analysis purposes, by generating the mean, sum, etc. of the order count for every column I select. E.g. if I select the item_id
and company_id
I want the mean of those as separate columns. Preferably with generated column headers, e.g. item_id_mean
and company_id_mean
.
I know how to group by one or multiple columns. This works manually perfect for every selected column. But now want to run this for a larger set of columns based on a column selection (through a separate file)
Question: How to group by every single column from a list of columns headers individually and calculate the mean (or sum)? Also, how to generate the respective headers for these new columns?
My testcode:
df_columnselect = pd.DataFrame({'Column':['ordernumber','item_id', 'company_id','country', 'count'],
'Selected':['no','yes','yes','no','no']},
columns=['Column','Selected’])
print(df_columnselect)
Column Selected
0 ordernumber no
1 item_id yes
2 company_id yes
3 country no
4 count no
Selected_Cols=df_columnselect.loc[df_columnselect.Selected=='yes', 'Column']
print(Selected_Cols)
1 item_id
2 company_id
df_orders = pd.DataFrame({'ordernumber':[1,2,3,4,5,6,7,8,9,10],
'item_id': [1, 1, 2, 2, 3, 4, 4, 5, 7, 8],
'company_id':['a','b','c','a','b','c','a','b','c','a'],
'country':[0, 1, 0, 1, 1, 0, 1, 1, 1, 0],
'count':[27, 49, 3, 1, 6, 8, 14, 1, 1, 6] },
columns=['ordernumber','item_id', 'company_id','country', 'count'])
I've tried te following, based on your input, but the first only generates 1 column and calculates the wrong mean. Whereas the second is the correct mean, but does not automate the generation of new columns based on the selection (Selected_Cols)
df_orders['new_col'] = df_orders.groupby(Selected_Cols)['count'].transform('mean')
df_orders['new_col2'] = df_orders.groupby('item_id')['count'].transform('mean')
print(df_orders)
ordernumber item_id company_id country count new_col new_col2
0 1 1 a 0 27 NaN 38
1 2 1 b 1 49 49.0 38
2 3 2 c 0 3 3.0 2
3 4 2 a 1 1 NaN 2
4 5 3 b 1 6 NaN 6
5 6 4 c 0 8 NaN 11
6 7 4 a 1 14 NaN 11
7 8 5 b 1 1 NaN 1
8 9 7 c 1 1 NaN 1
9 10 8 a 0 6 NaN 6
This should actually be, based on the selection in Selected_Cols, two new columns (or more if more selected). I've excluded the country column here, but that is not the issue.
ordernumber item_id company_id count item_id_mean company_id_mean
0 1 1 a 27 38 12.00
1 2 1 b 49 38 18.67
2 3 2 c 3 2 4.00
3 4 2 a 1 2 12.00
4 5 3 b 6 6 18.67
5 6 4 c 8 11 4.00
6 7 4 a 14 11 12.00
7 8 5 b 1 1 18.67
8 9 7 c 1 1 4.00
9 10 8 a 6 8 12.00
I've looked around in the forums and tested some things, but could not find any similar discussions that tackle the generation of new columns (mean, sum, etc.) based on a selection of columns.
So basically I have two input files (csv or whatever) 1. The original order data 2. The selection of columns (based on column headers)
And I want to export one file (again cvs or whatever) 1. The original order data enriched with the mean of the count for every selected column as individual group (i.e. new column per grouping based on the selection)
Solution
def FUNCTION(x):
return df_orders.groupby(x)['count'].transform('mean')
df_orders = df_orders.join(df_orders[(Selected_Cols)].apply(FUNCTION).add_suffix('_mean'))
Derived from: Creating new column names from a list of strings in a loop