7

Suppose you have a simple pandas dataframe with a MultiIndex:

df = pd.DataFrame(1, index=pd.MultiIndex.from_tuples([('one', 'elem1'), ('one', 'elem2'), ('two', 'elem1'), ('two', 'elem2')]),
                  columns=['col1', 'col2'])

Printed as a table:

           col1  col2
one elem1     1     1
    elem2     1     1
two elem1     1     1
    elem2     1     1

Question: How do you add a "Total" row to that Dataframe?

Expected output:

             col1  col2
one   elem1   1.0   1.0
      elem2   1.0   1.0
two   elem1   1.0   1.0
      elem2   1.0   1.0
Total         4.0   4.0

First attempt: Naive implementation

If I am just ignoring the MultiIndex and follow the standard way

df.loc['Total'] = df.sum()

Output:

              col1  col2
(one, elem1)     1     1
(one, elem2)     1     1
(two, elem1)     1     1
(two, elem2)     1     1
Total            4     4

It seems to be correct, but the MultiIndex is transformed to Index([('one', 'elem1'), ('one', 'elem2'), ('two', 'elem1'), ('two', 'elem2'), 'Total'], dtype='object')


Second attempt: Be explicit

df.loc['Total', :] = df.sum()

or (being frustrated and changing the axis just out of spite)

df.loc['Total', :] = df.sum(axis=1)

Output (the same for both calls):

             col1  col2
one   elem1   1.0   1.0
      elem2   1.0   1.0
two   elem1   1.0   1.0
      elem2   1.0   1.0
Total         NaN   NaN

The MultiIndex is not transformed, but the Total is wrong (NaN != 4).

above_c_level
  • 3,579
  • 3
  • 22
  • 37

2 Answers2

9

The solution

You have to remove the index of df.sum() and just use the values:

df.loc['Total', :] = df.sum().values

Output:

             col1  col2
one   elem1   1.0   1.0
      elem2   1.0   1.0
two   elem1   1.0   1.0
      elem2   1.0   1.0
Total         4.0   4.0

Why was the second attempt wrong?

The second attempt was almost correct. But df.sum() has the Index(['col1', 'col2'], dtype='object'). Consequently, pandas isn't able to match the index. The new index ('Total', '') is appended but without values.

But why did df.loc['Total', :] = df.sum(axis=1) also fail? It has the correct Multiindex. Pandas does exactly what you told it, i.e. sum the columns. So, df.sum(axis=1) gives you the following dataframe:

one  elem1    2
     elem2    2
two  elem1    2
     elem2    2

This dataframe can't be matched with the original df in any meaningful sense.

above_c_level
  • 3,579
  • 3
  • 22
  • 37
  • This solution has the same visual result for the values, but it bumps all the values in the dataframe to float (to start with they were int64). The `df.loc['Total'] = df.sum().values` does not do this (it has the tupleized index problem instead..) – creanion Dec 14 '22 at 14:08
1

Building on @above_c_level's accepted answer, here is a function:

  • Handles multi-index
  • Also preserves dtypes
def with_totals(df: pd.DataFrame) -> pd.DataFrame:
     '''Return new df with row & col totals named ∑. 
        * Preserves dtypes. 
        * Handles multi-index.
    '''
     df['∑'] = df.sum(axis=1)  # Row totals in new column
     D = df.dtypes
     df.loc['∑', :] = df.sum().values 
     return df.astype(D)

Usage:

with_totals(df)

Output:

            col1 col2   ∑
one elem1      1    1   2
    elem2      1    1   2
two elem1      1    1   2
    elem2      1    1   2
∑              4    4   8

Discussion

The key move remains the twin use of .loc and .values:

df.loc['Total', :] = df.sum().values

But dtypes can get lost. In my case my ints were converted to objects, which rendered as floats, making it hard to read. The best answer I found was to remember and re-apply the dtypes.


ctwardy
  • 172
  • 7