Python/Pandas - Sum by row within hierarchical dataframe based on 3rd level row

Question

I'm new to python and currently undertaking a workplace project where we're moving excel based analysis to python for future proofing and i have hit a bit of a hurdle i'm struggling to resolve.

i have data with 3 hierarchical levels, in the below format, where the 1st level is a key internal metric when analysis business performance, 2nd level is the member/client's gender and 3rd is the year.

Suggested code to generate input dataframe:

col = pd.MultiIndex.from_product([['A', 'B'], ['F', 'M', 'DNC'],[2016,2017]])
df = pd.DataFrame([[1,2,6,16,20,19,17,9,4,2,8,19]], columns=col)

table markdown formatting doesnt appear to have pulled through correctly

|                       A                 |                 B     
 ----------------------------------------------------------------------------     
|     |     F     |     M     |     DNC   |     F     |     M     |    DNC  
 ----------------------------------------------------------------------------
|     |2016 |2017 |2016 |2017 |2016 |2017 |2016 |2017 |2016 |2017 |2016 |2017 
 ----------------------------------------------------------------------------  
|Div  |                                                               
|  0  |  1  |   2 |   6 | 16  |  20 | 19  |  17 |   9 |  4  | 2   | 8   |  10

i currently divisions (Div.) set as my index and is 59 entries in length. There are 6 different metrics and years span from 2010 to present. the data frame will be outputted into a xlsx file with each metric on its own tab and in turn this is to be used to generate line plots via seaborn.

i have the code every for the importing of the data once formatted and to generate the graphs.

i'm trying to add in an additional column into the 2nd level hierarchy to provide a total per year. columns i am struggling to add into the dataframe anotated by "*"

Again table markdown formatting challenge

| Metric |                         A                         |  
 ------------------------------------------------------------
| Gender |      F     |      M     |     DNC    | *Total*      |
 ------------------------------------------------------------
| Year   | 2016 | 2017| 2016 | 2017| 2016 | 2017| *2016* | *2017*|
 ------------------------------------------------------------
| Div.   |                                                   |  
|    1   |   1  | 2   | 6    |  16 | 20   | 19  | *27*   | *37*  |

i have tried using get_level_values, .groupby(level =[0,2] and a few other avenues where i am able to display the values however i havent been able to append them into the original dataframe.

df.groupby(level=[0,2], axis = 1).sum():

3rd table mark down

| Metric |      A     | 
 --------------------- 
| Year   | 2016 | 2017|
 ---------------------
| Div.   |            |  
|    1   | 27   | 37  |

I've also tried flattening the columns into the format Metric.Gender.Year e.g. (A.F.2016). However that then causes an error with the output when writing to multi-tabs based on the Metric. I'm not to familiar with how this code works as to how it can be amended to split the data based on the substring. I am thinking of of running a list of the divisions through a for loop utilising str.startswith:

for x in div_list:
    for group, data in df.groupby(df.columns.str.startswith(x), axis=1):
         data.to_excel (pd.ExcelWriter(filepath), group, merge_cells = False)
writer.close()

however have yet to test the above (am suspecting there to be some error in the code above given my knowledge level of python)

Once i have the above added, i will also be normalising the results by client count, i have this information by division and year. and assigning these results to a different dataframe to then be used to generate the same graphs as the original quey.

Div.	Count.F.2016	Count.M.2016	Count.Total.2016	..
1	100	90	190	....

i've only added the further two points (writing to excel and normalisation) for greater context of the query. however any advice you may have will be greatly appreciated.

I also fully suspect some of my workings/endeavours to not be the most pythonic and suspect my existing code has a lot of inefficiencies, however i'm keen to get the core work progressed so that the output can be used by key stakeholders while I then try to make the code more efficient, as this is a piece of work that is run semi-regularly (i have set up the script to import the underlying data and key parameters to allow for easy modification, (as few key parameters being hardcoded as possible)

Please read [how to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — mozway, May 09 '23 at 12:23
Your question here is sort of confusing to me. What exactly are you trying to achieve? — Scott Boston, May 09 '23 at 13:21
sorry, about thatm. i'm trying to append the sum of the distinct years under each of the Metrics for all genders, so that i can then visual the data for the 3 categories and also total. — Fresh, May 09 '23 at 13:24

score 0 · Answer 1 · answered May 09 '23 at 13:27

It may be easier to work with the transposed version:

>>> df = df.transpose()
>>> df
             0
A F   2016   1
      2017   2
  M   2016   6
      2017  16
  DNC 2016  20
      2017  19
B F   2016  17
      2017   9
  M   2016   4
      2017   2
  DNC 2016   8
      2017  19

Then get the values as you said:

>>> sums = df.groupby(level=[0,2]).sum()
>>> sums
         0
A 2016  27
  2017  37
B 2016  29
  2017  30

And finally join the two dataframes:

>>> merge = pd.merge(left=df, right=sums, left_on=[df.index.get_level_values(0), df.index.get_level_values(2)], right_index=True)
>>> merge
           key_0  key_1  0_x  0_y
A F   2016     A   2016    1   27
  M   2016     A   2016    6   27
  DNC 2016     A   2016   20   27
  F   2017     A   2017    2   37
  M   2017     A   2017   16   37
  DNC 2017     A   2017   19   37
B F   2016     B   2016   17   29
  M   2016     B   2016    4   29
  DNC 2016     B   2016    8   29
  F   2017     B   2017    9   30
  M   2017     B   2017    2   30
  DNC 2017     B   2017   19   30

This, created some undesired duplicated columns, so just drop them and rename the other columns:

>> merge = merge.drop(columns=['key_0', 'key_1']).rename(columns={'0_x': 'div', '0_y': 'sum'})
>>> merge
            div  sum
A F   2016    1   27
  M   2016    6   27
  DNC 2016   20   27
  F   2017    2   37
  M   2017   16   37
  DNC 2017   19   37
B F   2016   17   29
  M   2016    4   29
  DNC 2016    8   29
  F   2017    9   30
  M   2017    2   30
  DNC 2017   19   30

If you really need to have the dataframe transposed as before, just do it:

>>> merge.transpose()
       A                             B                         
       F    M  DNC    F    M  DNC    F    M  DNC    F    M  DNC
    2016 2016 2016 2017 2017 2017 2016 2016 2016 2017 2017 2017
div    1    6   20    2   16   19   17    4    8    9    2   19
sum   27   27   27   37   37   37   29   29   29   30   30   30

Thanks DecowVR, looks good, would this work if i had div 1 ->59, as i'm looking to as a 'Total' column under near A, B.... my Division data goes from 1 to 59 and the metrics are A ->G. i just cut it back for ease of use — Fresh, May 09 '23 at 13:31

jezrael · Accepted Answer · 2023-05-09T13:39:04.007

0

Use concat with DataFrame.reorder_levels for DataFrame with same levels like original DataFrame, join to original and change order of columns by DataFrame.reindex:

mux = pd.MultiIndex.from_product([df.columns.levels[0], 
                                  ['F', 'M', 'DNC','Total'], 
                                  df.columns.levels[2]])

df1 = (pd.concat({'Total': df.groupby(level=[0,2], axis = 1).sum()}, axis=1)
         .reorder_levels([1,0,2], axis=1))

df = pd.concat([df, df1], axis=1).reindex(mux, axis=1)
print (df)
     A                                        B                           \
     F         M       DNC      Total         F         M       DNC        
  2016 2017 2016 2017 2016 2017  2016 2017 2016 2017 2016 2017 2016 2017   
0    1    2    6   16   20   19    27   37   17    9    4    2    8   19   

              
  Total       
   2016 2017  
0    29   30

edited May 09 '23 at 13:39

answered May 09 '23 at 13:33

jezrael

822,522
95
1,334
1,252

1

Thanks for the above, much appreciated worked a treat. – Fresh May 11 '23 at 11:04
@Fresh - some problem with my solution? – jezrael Jun 19 '23 at 11:06
@Fresh - Sorry, another solution create `Total` row, not new `Total` column. So reaacepted by accident? – jezrael Jun 19 '23 at 11:09
sorry originally solution worked perfectly, i had a few items marked unread in my inbox, so though would do the admin work and mark it as the answer as it was showing unanswred – Fresh Jun 19 '23 at 14:23
@Fresh - I see, dont worry, you can reaccapt back my answer if working like need. – jezrael Jun 20 '23 at 05:14

Python/Pandas - Sum by row within hierarchical dataframe based on 3rd level row

2 Answers2