Calculated Columns in Multiindex

Question

I want to insert 2 columns into a dataframe.

Original Dataframe

card    auth       month   order_number
Amex     A        2017-11       1234
Visa     A        2017-12       2345
Amex     D        2017-12       3456

I want to break down auth_status by month. I used the following code:

bin_month_df = monthly_df.pivot_table(index='card', columns=['month', 'auth'],values='order_number', aggfunc='count')

Dataframe by Month

   month         2017-11      2017-12
    auth         A    D       A    D
    card
    mastercard  10    11     11    10
    amex        19    20     10    11
    visa        50    30     50    1

Target Outcome

I want to add columns for subtotal and auth_rate

       month                   2017-11                       2017-12
        auth         A    D   total    pct           A    D    total  pct
        card
        mastercard  10    11     21    .47           11    10   21    .52
        amex        19    20     39    .49           10    11   21    .47
        visa        50    30     80    .63           50    1    51    .98

I'm having trouble creating these columns.This link shows subtotals by rows, but it's not translating for me into columns or calculated columns.

Any help is appreciated!

score 1 · Answer 1 · answered Jun 10 '18 at 06:29

Use:

#create sum by first level of MultiIndex
df1 = df.sum(axis=1, level=0)
df1.columns = [df1.columns, ['total'] * len(df1.columns)]
print (df1)
month      2017-11 2017-12
             total   total
card                      
mastercard      21      21
amex            39      21
visa            80      51

#select by second level and divide
df2 = df.xs('A', axis=1, level=1).div(df1.xs('total', axis=1, level=1)).round(2)
df2.columns = [df2.columns, ['pct'] * len(df2.columns)]
print (df2)
month      2017-11 2017-12
               pct     pct
card                      
mastercard    0.48    0.52
amex          0.49    0.48
visa          0.62    0.98

#join all together, sort MultiIndex
df3 = pd.concat([df, df1, df2], axis=1).sort_index(axis=1)
print (df3)
month      2017-11                 2017-12                
auth             A   D   pct total       A   D   pct total
card                                                      
mastercard      10  11  0.48    21      11  10  0.52    21
amex            19  20  0.49    39      10  11  0.48    21
visa            50  30  0.62    80      50   1  0.98    51

#for custom order reindex by custom MultiIndex
c = df.columns.levels[1].tolist() + ['total', 'pct']
mux = pd.MultiIndex.from_product([df.columns.levels[0], c], names=df.columns.names)
df4 = df3.reindex(columns=mux)
print(df4)
month      2017-11                 2017-12                
auth             A   D total   pct       A   D total   pct
card                                                      
mastercard      10  11    21  0.48      11  10    21  0.52
amex            19  20    39  0.49      10  11    21  0.48
visa            50  30    80  0.62      50   1    51  0.98

jxc · Accepted Answer · 2018-06-10T03:13:10.030

Just tested on Pandas 0.17.0 and Python 2.7.5 and I can now understand why you asked me issues with reindex(axis=1) and the '*' before df1.columns.levels[1]. It was indeed version problems from both Pandas and Python. I modified the code to run through the older versions mentioned above and also fixed one potential bug in case multiple common descriptive stats need to be post-calculated in the same Pivot table. Going forward, it would be easier to mention the versions of software (if they are older versions) in your future posts, so there will be less misunderstanding:

import pandas as pd

str = """card    auth   month   order_number
Amex     A        2017-11       1234
Visa     A        2017-12       2345
Amex     D        2017-12       3416
MC       A        2017-12       3426
Visa     A        2017-11       3436
Amex     D        2017-12       3446
Visa     A        2017-11       3466
Amex     D        2017-12       3476
Visa     D        2017-11       3486
"""

# create dataframe from the above sample data
df = pd.read_table(pd.io.common.StringIO(str), sep='\s+')

# create the pivot_table using the method OP supplied
df1 = df.pivot_table(index='card', columns=['month', 'auth'], values='order_number', aggfunc='count')
print(df1)
# month 2017-11      2017-12     
# auth        A    D       A    D
# card                           
# Amex      1.0  NaN     NaN  3.0
# MC        NaN  NaN     1.0  NaN
# Visa      2.0  1.0     1.0  NaN

# create an empty dataframe with the same index/column layout as df1
# except the level-1 in columns
idx = pd.MultiIndex.from_product([df1.columns.levels[0], ['total', 'avg', 'std', 'pct']], names=df1.columns.names)
df2 = pd.DataFrame(columns=idx, index=df1.index).sort_index(axis=1)

print(df2)
# month 2017-11                 2017-12                
# auth      avg  pct  std total     avg  pct  std total
# card                                                 
# Amex      NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN
# MC        NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN
# Visa      NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN

# Calculate the common stats:
df2.loc[:,(slice(None),'total')] = df1.groupby(level=0, axis=1).sum().values
df2.loc[:,(slice(None),'avg')]   = df1.groupby(level=0, axis=1).mean().values
df2.loc[:,(slice(None),'std')]   = df1.groupby(level=0, axis=1).std().values

# join df2 with df1 and assign the result to df3 (can also overwrite df1): 
df3 = df1.join(df2).sort_index(axis=1)

# calculate `pct` which needs both a calculated field and an original field
# auth-rate = A / total
df3.loc[:,(slice(None),'pct')] = df3.groupby(level=0, axis=1)\
                                    .apply(lambda x: x.loc[:,(slice(None),'A')].values/x.loc[:,(slice(None),'total')].values) \
                                    .values

print(df3)
# month 2017-11                                    2017-12                      
# auth        A   D  avg       pct       std total       A   D avg pct std total
# card                                                                          
# Amex        1 NaN  1.0  1.000000       NaN     1     NaN   3   3 NaN NaN     3
# MC        NaN NaN  NaN       NaN       NaN   NaN       1 NaN   1   1 NaN     1
# Visa        2   1  1.5  0.666667  0.707107     3       1 NaN   1   1 NaN     1

# rounding if needed:
df3.loc[:,(slice(None),'pct')] = df3.loc[:,(slice(None),'pct')].round(decimals=2)

If you want to sort the level-1 columns in a specific order, you can do the reindex().

# create a ordered list of level-1 on columns 
column_level_1 = list(df1.columns.levels[1]) + ['total', 'avg', 'std', 'pct']
# create MultiIndex for columns and reindex_axis accordingly
midx = pd.MultiIndex.from_product([df1.columns.levels[0], column_level_1], names=df1.columns.names)
df3 = df3.reindex_axis(midx, axis=1)
print(df3)

# month 2017-11                                    2017-12                      
# auth        A   D total  avg       std       pct       A   D total avg std pct
# card                                                                          
# Amex        1 NaN     1  1.0       NaN  1.000000     NaN   3     3   3 NaN NaN
# MC        NaN NaN   NaN  NaN       NaN       NaN       1 NaN     1   1 NaN   1
# Visa        2   1     3  1.5  0.707107  0.666667       1 NaN     1   1 NaN   1

Thanks for the response. I'm getting a "reindex() got an unexpected keyword argument "axis" error using the code above. A second question is around the asterisk [*df1.columns...]. Can you explain? — jonjon, Jun 08 '18 at 18:04
Not sure if it's from version conflict(I am using the 0.23.0 now). Can you try `df1.reindex_axis(midx, axis=1)`? — jxc, Jun 08 '18 at 18:11
the issue from asterisk '*' (an Python3 feature) was indeed from Python version conflict. I modified some code logic and fixed a potential bug if you need more common stats like `mean`, `std` in the same Pivot table. it should be working now on Python 2.7.5 and Pandas 0.17.0. — jxc, Jun 10 '18 at 03:24

Calculated Columns in Multiindex

2 Answers2