3

I have a dataframe in Pandas like below -

import pandas as pd
data = {'Category': ['cat2','cat1','cat2','cat1','cat2','cat1','cat2','cat1','cat1','cat1','cat2'],
        'values': [1,2,3,1,2,3,1,2,3,5,1]}
my_data = pd.DataFrame(data)

I get the minimum, maximum and average values per category values and I also rename the column names as shown below -

# Get the minimum value for each column
min_agg = my_data.groupby('Category').min().reset_index()
min_agg.columns = [str(col) + '_min' for col in min_agg.columns]

   category_min  values_min
0   cat1            1
1   cat2            1

# Get the minimum value for each column
max_agg = my_data.groupby('Category').max().reset_index()
max_agg.columns = [str(col) + '_max' for col in max_agg.columns]


   category_max  values_max
0   cat1            5
1   cat2            3

# Get average value for each column
avg_agg = my_data.groupby('Category').mean().reset_index()
avg_agg.columns = [str(col) + '_avg' for col in avg_agg.columns]


   category_avg  values_avg
0   cat1           2.666667
1   cat2           1.600000


# count number of zeros
zeros_agg = my_data.groupby('Category')['values'].apply(lambda column: (column == 0).sum()).reset_index()
zeros_agg.columns = [str(col) + '_zeros' for col in zeros_agg.columns]



   Category_zeros  values_zeros
0   cat1                0
1   cat2                0

Now, I want to vertically stack these 4 dataframes to get a final dataframe that has 8 records, 2 each per dataframe which looks like this -

category            values
cat1_min            1
cat2_min            1
cat1_max            5
cat2_max            3
cat1_avg           2.666667
cat2_avg           1.600000
cat1_zeros          0
cat2_zeros          0

The first column in the output shows what aggregation is applied on which category and second column shows the corresponding value.

How do I do this using pandas ?

I tried

vertical_stack = pd.concat([min_agg, max_agg, avg_agg,zeros_agg], axis=0 , keys=['Category_min','Category_max','Category_avg','Category_zeros'])

but it did not give me expected output.

The aggregation shown here is only on 1 column but I have a bigger dataset and I am computing this aggregations on many columns.

Regressor
  • 1,843
  • 4
  • 27
  • 67

2 Answers2

2

You can pivot on category, with your desired aggregation functions, then stack the result:

r = df.pivot_table(
    columns='Category', values='values', aggfunc=['min', 'max', 'mean']) 
r.columns = r.columns.map('_'.join)  

r.T                                                                                                                                                  

             values
min_cat1   1.000000
min_cat2   1.000000
max_cat1   5.000000
max_cat2   3.000000
mean_cat1  2.666667
mean_cat2  1.600000

This is similar to

r = df.groupby('Category')['values'].agg(['min', 'max', 'mean']).stack() 
r.index = r.index.map('_'.join)

r
cat1_min     1.000000
cat1_max     5.000000
cat1_mean    2.666667
cat2_min     1.000000
cat2_max     3.000000
cat2_mean    1.600000
dtype: float64
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Great answer. My aggregated dataframes also include counting zeroes and non-zeroes for multiple columns and hence i cannot put everything in `aggfunc` ? Is there any other way to tackle this ? – Regressor Nov 11 '19 at 22:34
  • @Regressor You can pass lambda function to the aggfunc, if you need to. – cs95 Nov 11 '19 at 22:34
  • @Regressor Do you have pandas 0.25? Try `r = df.groupby('Category')['values'].agg(min='min', max='max', mean='mean', zeros=lambda x: (~x.astype(bool)).sum()).stack(); r.index = r.index.map('_'.join); print(r)` – cs95 Nov 11 '19 at 22:44
  • these are excellent answers. do we have an option to perform same thing using `pd.concat` or `pd.merge` ? – Regressor Nov 11 '19 at 22:47
  • @Regressor pd.concat and pd.merge are both options for when you need to join or coalesce multiple data from two or more dataframes. Your question is one of aggregation, those are not appropriate functions for this :) You can look at my [merging post](https://stackoverflow.com/questions/53645882/pandas-merging-101/53645883#53645883) if you have some time to kill and would like to better understand how and where these functions apply for you. – cs95 Nov 11 '19 at 22:48
1

I will take advantage of describe

s=df.groupby('Category')['values'].describe()[['mean','max','min']].stack()
s.index=s.index.map('_'.join)
s
cat1_mean    2.666667
cat1_max     5.000000
cat1_min     1.000000
cat2_mean    1.600000
cat2_max     3.000000
cat2_min     1.000000
dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234