0

I generate summary statistics for each group in the data while using multiple grouping criteria. The data:

import pandas as pd
d = {'Fruit': ['Apple', 'Apple','Apple','Orange','Orange'], 
     'City': ['Rome', 'Rome', 'London', 'London','London' ],
     'Value': [1, 2, 3, 4, 5 ]}
df = pd.DataFrame(data=d)
print(df)

    Fruit    City  Value 
0  Apple   Rome    1    
1  Apple   Rome    2    
2  Apple   London  3    
3  Orange  London  4    
4  Orange  London  5    

The "groupby" command returns the sums only for the highest level.

keys=['Fruit','City']

df.groupby(keys).agg(Total_Value=('Value', 'sum'))  

               Total_Value
Fruit  City               
Apple  London  3          
       Rome    3          
Orange London  9          

I would like to apply the sum function to the lower levels as well. Preferably the results for all levels should be displayed on the same table:

               Total_Value
Fruit  City
               15
Apple          6
Orange         9
       London  12
       Rome    3
Apple  London  3          
Apple  Rome    3          
Orange London  9 

Is there an easy way of generating such a table? It might be possible to use the Multiindex function for this purpose, yet I am not sure how it should be applied.

Thanks a lot

PS: In the example I have used the "sum" function. It is relatively easy to calculate the sums for lower levels. I am looking for a generic answer which is applicable to all functions, not only to sum. (e.g. mean function, or a lambda function etc.)

PS2: Something like the following would also work:

               Total_Value Level
Fruit  City               
                        15 0
       Rome              3 1
       London           12 1
Apple                    6 1
Orange                   9 1
Apple  Rome              3 2
       London            3 2
Orange London            9 2

In this case the "Total_Value" of "London" is immediately clear. It is not necessary to check the order of the rows.

  • Thanks a lot for your quick answer. I think in this case I would need an additional indicator which shows the level of each row. Because there are two entries for "London": one as a "parent" category, the other one to describe "Apple & London". What would be the best solution in your opinion? – user2201133 May 20 '20 at 12:42
  • Can you edit question? Why is necesary this indicator? – jezrael May 20 '20 at 12:49
  • There are two values for London: "12" and "3". You can infer which one is which by looking at the order of the rows in the table, but it is not immediately clear. I will try to create an additional column to be able to identify the level. "12" will be level 1, "3" will be level 2. Your answer was already quite helpful. – user2201133 May 20 '20 at 12:56
  • So need new new column for identify it? with values like `All, Fruit, Fruit, City, City, Fruit & City,Fruit & City,Fruit & City` ? – jezrael May 20 '20 at 13:04
  • Yes, that would be perfect. I have edited the post. – user2201133 May 20 '20 at 13:18
  • edited answer with this values. – jezrael May 20 '20 at 13:34

2 Answers2

1

You can create all possible combinations of keys, then convert to dicts for replace to same value, here empty string and pass to DataFrame.assign, then concat together and aggregate sum or any another function like need:

keys = ['Fruit','City']

#https://stackoverflow.com/a/5898031
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(0, len(ss)+1)))

dfs = [df.assign(**dict.fromkeys(x, '')) for x in all_subsets(keys)]

#swapped order of list of DataFrames
df1 = (pd.concat(dfs[::-1], sort=False)
         .groupby(keys, sort=False)
         .agg(Total_Value=('Value', 'sum')))
print (df1)
               Total_Value
Fruit  City               
                        15
Apple                    6
Orange                   9
       Rome              3
       London           12
Apple  Rome              3
       London            3
Orange London            9

Alternative:

keys = ['Fruit','City']

#swapped order in range
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(len(ss)+1, -1, -1)))

dfs = [df.assign(**dict.fromkeys(x, '')) for x in all_subsets(keys)]

df = pd.concat(dfs, sort=False).groupby(keys, sort=False).agg(Total_Value=('Value', 'sum'))  
print (df)
               Total_Value
Fruit  City               
                        15
       Rome              3
       London           12
Apple                    6
Orange                   9
Apple  Rome              3
       London            3
Orange London            9

EDIT:

d = {'Fruit': ['Apple', 'Apple','Apple','Orange','Orange'], 
     'City': ['Rome', 'Rome', 'London', 'London','London' ],
     'Value': [1, 2, 3, 4, 5 ]}
df = pd.DataFrame(data=d)
print(df)

keys = ['Fruit','City']

#https://stackoverflow.com/a/5898031
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(0, len(ss)+1)))

dfs = [df.assign(**dict.fromkeys(x, '')) for x in all_subsets(keys)]

#swapped order of list of DataFrames
df1 = (pd.concat(dfs[::-1], sort=False)
         .groupby(keys, sort=False)
         .agg(Total_Value=('Value', 'sum')))

Last you can convert MultiIndex to DataFrame by MultiIndex.to_frame, comapre for not equal empty string and using DataFrame.dot trick create Series by values of columns if not '':

df2 = df1.index.to_frame()
s = df2.ne('').dot(df2.columns + ' & ').str.strip('& ').replace('', 'All')
print (s)
Fruit   City  
                           All
Apple                    Fruit
Orange                   Fruit
        Rome              City
        London            City
Apple   Rome      Fruit & City
        London    Fruit & City
Orange  London    Fruit & City
dtype: object

df1['level'] = s
print (df1)
               Total_Value         level
Fruit  City                             
                        15           All
Apple                    6         Fruit
Orange                   9         Fruit
       Rome              3          City
       London           12          City
Apple  Rome              3  Fruit & City
       London            3  Fruit & City
Orange London            9  Fruit & City
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I would do it this way:

import pandas as pd
d = {'Fruit': ['Apple', 'Apple','Apple','Orange','Orange'], 
     'City': ['Rome', 'Rome', 'London', 'London','London' ],
     'Value': [1, 2, 3, 4, 5 ]}
df = pd.DataFrame(data=d)

keys=['Fruit','City']

df_sum = df.groupby(keys).agg(Total_Value=('Value', 'sum')) 
df_sub = df_sum.sum(level=0).assign(City='Fruit SubTotal').set_index('City', append=True)
df_grand = df_sum.sum().to_frame().T.rename_axis('Fruit').rename(index={0:'Grand'}).assign(City='Total').set_index('City', append=True)

df_out = pd.concat([df_sum, df_sub, df_grand])
print(df_out)

Output:

                       Total_Value
Fruit  City                       
Apple  London                    3
       Rome                      3
Orange London                    9
Apple  Fruit SubTotal            6
Orange Fruit SubTotal            9
Grand  Total                    15
Scott Boston
  • 147,308
  • 15
  • 139
  • 187