1

I have a DataFrame overview (shortened for this question).

>>> import pandas as pd
>>> import numpy as np
>>>
>>> index = pd.Index(['Abbott PLC', 'Abbott, Rogahn and Bednar'], dtype='object', name='Account Name')
>>> columns = pd.MultiIndex(levels=[['total', 'quantity'], ['count', 'sum']], labels=[[0, 0, 1], [1, 0, 1]])
>>> values = np.array([[755.44,   1.  ,  19.  ], [615.6 ,   1.  ,  18.  ]])
>>> 
>>> overview = pd.DataFrame(values, columns=columns, index=index)
>>> overview
                            total       quantity
                              sum count      sum
Account Name                                    
Abbott PLC                 755.44   1.0     19.0
Abbott, Rogahn and Bednar  615.60   1.0     18.0

The column names are weird because in my actual code, this result is derived from a DataFrame df with the following grouping operation.

aggregators = {'total': ['sum', 'count'], 'quantity': 'sum'}
overview = df.groupby('Account Name')['total', 'quantity'].agg(aggregators)

I want to rename the columns of overview, the desired result looks like this:

                            gross  checkouts  items
Account Name                                       
Abbott PLC                 755.44        1.0   19.0
Abbott, Rogahn and Bednar  615.60        1.0   18.0

I cannot simply use overview.columns = ['gross', 'checkouts', 'items'] or the answers from this similar question because after using agg the columns are in arbitrary order. (Applying rename also seems tricky because of the duplicate name 'sum'.)

Currently, I am working around the problem by using an OrderedDict for aggregators, thus overview having a deterministic column oder. But assuming the creation of overview cannot be fixed upstream, how would I arrive at my desired result elegantly?

jpp
  • 159,742
  • 34
  • 281
  • 339
timgeb
  • 76,762
  • 20
  • 123
  • 145
  • 1
    If you flatten and join your columns , you'll be left with something like `Index(['total_sum', 'total_count', 'quantity_sum'], dtype='object')`, which will be unique, then you can rename off of that if I'm understanding you correctly. – user3483203 Aug 13 '18 at 06:46
  • @user3483203 that sounds promising, feel free to write an answer teaching me the flatten and join process. – timgeb Aug 13 '18 at 06:47
  • 1
    Try using `overview.columns = ['_'.join(col).strip() for col in overview.columns.values]` and see if that's the route you want to take, if it is, I'll write it up as an answer. – user3483203 Aug 13 '18 at 06:50
  • @user3483203 that seems to work fine. I'm wondering if there's a more "pandas native" solution but for now it does the trick! – timgeb Aug 13 '18 at 07:02
  • 1
    I don't know of a better way builtin to pandas, but glad I could help! – user3483203 Aug 13 '18 at 07:08
  • how about `overview.set_axis(1,overview.columns.ravel()).rename(columns={('total','sum'):'gross',('total','count'):'checkouts'})` – Ian Aug 13 '18 at 08:18
  • @Ian why not post this as an answer? – timgeb Aug 13 '18 at 11:48

1 Answers1

0

Your dataframe has a MultiIndex as columns. There are few ways you can flatten into a regular index:

pd.Index.map

overview.columns = overview.columns.map('_'.join)

list comprehension + f-strings

With Python 3.6+ you can use formatted string literals (PEP 498):

overview.columns = [f'{i}_{j}' for i, j in overview.columns]

list comprehension + str.format / str.join

For versions <3.6, you can use str.format or str.join:

overview.columns = ['{i}_{j}'.format(i, j) for i, j in overview.columns]

overview.columns = list(map('_'.join, overview.columns))

To only rename, you can use a dictionary mapping directly:

d = {('total', 'sum'): 'gross', ('total', 'count'): 'checkouts',
     ('quantity', 'sum'): 'items'}

overview.columns = np.vectorize(d.get)(overview.columns)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks. I know how to proceed from the temporary column names. For completeness you might consider including the final renaming in the answer. – timgeb Aug 13 '18 at 11:43
  • @timgeb, Sure, I've also added a more direct method, which doesn't require an intermediary renaming. – jpp Aug 13 '18 at 11:59
  • Now I understand. The container `overview.columns` works very differently than what its printout would suggest. In fact, I can treat it just like a list of tuples `[('total', 'sum'), ('total', 'count'), ('quantity', 'sum')]` in this context. – timgeb Aug 15 '18 at 07:13
  • @timgeb, Exactly so. There's even a method to construct an `Index` from a list of tuples: [`pd.MultiIndex.from_tuples`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.from_tuples.html). – jpp Aug 15 '18 at 08:05