1

For this exercise I have a dataframe containing orders numbers, item ids, company ids, countries and count.

In practice I have many more columns. Now I want to automate some calculations for analysis purposes, by generating the mean, sum, etc. of the order count for every column I select. E.g. if I select the item_id and company_id I want the mean of those as separate columns. Preferably with generated column headers, e.g. item_id_mean and company_id_mean.

I know how to group by one or multiple columns. This works manually perfect for every selected column. But now want to run this for a larger set of columns based on a column selection (through a separate file)

Question: How to group by every single column from a list of columns headers individually and calculate the mean (or sum)? Also, how to generate the respective headers for these new columns?

My testcode:

df_columnselect = pd.DataFrame({'Column':['ordernumber','item_id', 'company_id','country', 'count'],
                            'Selected':['no','yes','yes','no','no']},
                            columns=['Column','Selected’])

print(df_columnselect)

    Column Selected
0  ordernumber       no
1      item_id      yes
2   company_id      yes
3      country       no
4        count       no

Selected_Cols=df_columnselect.loc[df_columnselect.Selected=='yes', 'Column']

print(Selected_Cols)

1       item_id
2    company_id

df_orders = pd.DataFrame({'ordernumber':[1,2,3,4,5,6,7,8,9,10],
                    'item_id': [1, 1, 2, 2, 3, 4, 4, 5, 7, 8],
                      'company_id':['a','b','c','a','b','c','a','b','c','a'],
                    'country':[0, 1, 0, 1, 1, 0, 1, 1, 1, 0],
                    'count':[27, 49, 3, 1, 6, 8, 14, 1, 1, 6] }, 
                     columns=['ordernumber','item_id', 'company_id','country', 'count'])

I've tried te following, based on your input, but the first only generates 1 column and calculates the wrong mean. Whereas the second is the correct mean, but does not automate the generation of new columns based on the selection (Selected_Cols)

df_orders['new_col'] = df_orders.groupby(Selected_Cols)['count'].transform('mean')
df_orders['new_col2'] = df_orders.groupby('item_id')['count'].transform('mean')

print(df_orders)

    ordernumber  item_id company_id  country  count  new_col  new_col2
0            1        1          a        0     27      NaN        38
1            2        1          b        1     49     49.0        38
2            3        2          c        0      3      3.0         2
3            4        2          a        1      1      NaN         2
4            5        3          b        1      6      NaN         6
5            6        4          c        0      8      NaN        11
6            7        4          a        1     14      NaN        11
7            8        5          b        1      1      NaN         1
8            9        7          c        1      1      NaN         1
9           10        8          a        0      6      NaN         6

This should actually be, based on the selection in Selected_Cols, two new columns (or more if more selected). I've excluded the country column here, but that is not the issue.

   ordernumber  item_id company_id  count  item_id_mean  company_id_mean
0            1        1          a     27            38            12.00
1            2        1          b     49            38            18.67
2            3        2          c      3             2             4.00
3            4        2          a      1             2            12.00
4            5        3          b      6             6            18.67
5            6        4          c      8            11             4.00
6            7        4          a     14            11            12.00
7            8        5          b      1             1            18.67
8            9        7          c      1             1             4.00
9           10        8          a      6             8            12.00

I've looked around in the forums and tested some things, but could not find any similar discussions that tackle the generation of new columns (mean, sum, etc.) based on a selection of columns.

So basically I have two input files (csv or whatever) 1. The original order data 2. The selection of columns (based on column headers)

And I want to export one file (again cvs or whatever) 1. The original order data enriched with the mean of the count for every selected column as individual group (i.e. new column per grouping based on the selection)

Solution

def FUNCTION(x):
return df_orders.groupby(x)['count'].transform('mean')

df_orders = df_orders.join(df_orders[(Selected_Cols)].apply(FUNCTION).add_suffix('_mean'))

Derived from: Creating new column names from a list of strings in a loop

johntkrpt
  • 41
  • 1
  • 3
  • Try to look at `df.groupby('item_id').count.transform('mean')` – BENY Mar 13 '18 at 12:58
  • So need `df_orders['item_id_mean'] = df_orders.groupby('item_id')['count'].transform('mean')` and `df_orders['company_id_mean'] = df_orders.groupby('company_id')['count'].transform('mean')` – jezrael Mar 13 '18 at 13:53
  • Hi Jezrael, thanks and that works indeed if I wanted it hardcoded in Python. But I'm looking for a solution that can iterate over the columns I've selected (can be easily 30-100 columns and changing) This way I can write a simple program that picks up the Selected_Cols DataFrame and than generates a mean of the count for every column that is selected, not just the two I hardcoded in Python. Makes it easier to run over multiple files and selections for non Python-users. see also last edited paragraph. – johntkrpt Mar 13 '18 at 14:03
  • Added solution I needed, thanks for the input. Works great! – johntkrpt Mar 13 '18 at 15:24

0 Answers0