541

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

I want to flatten it, so that it looks like this (names aren't critical - I could rename):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

How do I do this? (I've tried a lot, to no avail.)

Per a suggestion, here is the head in dict form

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Ross R
  • 8,853
  • 7
  • 28
  • 27
  • There is a [suggestion on the `pandas` issue tracker](https://github.com/pandas-dev/pandas/issues/19950) to implement a dedicated method for this. – joelostblom Jul 10 '18 at 01:29
  • 9
    @joelostblom and it has in fact been implemented (pandas 0.24.0 and above). I posted [an answer](https://stackoverflow.com/a/55757002/7515530) but essentially now you can just do `dat.columns = dat.columns.to_flat_index()`. Built-in pandas function. – onlyphantom Apr 19 '19 at 05:52
  • The solution that worked for me is `df.reset_index(drop=True, inplace=True) ` The drop=True was the critical part. – Shane S Dec 08 '21 at 19:23

19 Answers19

727

I think the easiest way to do this would be to set the columns to the top level:

df.columns = df.columns.get_level_values(0)

Note: if the to level has a name you can also access it by this, rather than 0.

.

If you want to combine/join your MultiIndex into one Index (assuming you have just string entries in your columns) you could:

df.columns = [' '.join(col).strip() for col in df.columns.values]

Note: we must strip the whitespace for when there is no second index.

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 35
    *df.reset_index(inplace=True)* could be an alternative solution. – Tobias Jun 07 '17 at 06:15
  • 11
    one minor comment... if you want to use _ for the combine column multilevels.. you could use this... df.columns = ['_'.join(col).strip() for col in df.columns.values] – ihightower Jul 03 '17 at 16:05
  • 46
    minor modification to maintain underscore for joined cols only: `['_'.join(col).rstrip('_') for col in df.columns.values]` – Seiji Armstrong Feb 13 '18 at 19:02
  • 1
    This worked great, if you want just the second column use: df.columns = [col[1] for col in df.columns.values] – user3078500 Jun 25 '19 at 15:51
  • 2
    If you want to use `sum s_CD` instead of `s_CD sum`, one can do `df.columns = ['_'.join(col).rstrip('_') for col in [c[::-1] for c in df.columns.values]]`. – irene Jul 17 '19 at 08:32
  • for me: df.columns = ["_".join(str(c) for c in col) for col in df.columns.values] – Chau Pham Aug 21 '19 at 03:56
  • `df.columns = ['_'.join(col).lstrip('_') for col in df.columns.swaplevel().values]` is a swapped-levels version of what @SeijiArmstrong's suggested. – David J. Oct 02 '19 at 10:43
  • in my case i had to reverse it by df.columns = df.columns.get_level_values(1) – user190245 Jun 03 '20 at 20:07
  • If you have mixed int/str cols `columns = ['_'.join(tuple(map(lambda x: str(x),c))) for c in df.columns]` – Rich Andrews Mar 13 '21 at 01:08
  • Is there anyway to write a lambda to put in the rename method to chain it? – steven May 01 '21 at 21:21
206

All of the current answers on this thread must have been a bit dated. As of pandas version 0.24.0, the .to_flat_index() does what you need.

From panda's own documentation:

MultiIndex.to_flat_index()

Convert a MultiIndex to an Index of Tuples containing the level values.

A simple example from its documentation:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

Applying to_flat_index():

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

Using it to replace existing pandas column

An example of how you'd use it on dat, which is a DataFrame with a MultiIndex column:

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

Flattening and Renaming in-place

May be worth noting how you can combine that with a simple list comprehension (thanks @Skippy and @mmann1123) to join the elements so your resulting column names are simple strings separated by, for example, underscores:

dat.columns = ["_".join(a) for a in dat.columns.to_flat_index()]
onlyphantom
  • 8,606
  • 4
  • 44
  • 58
108
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only
tradinggy
  • 1,211
  • 1
  • 8
  • 6
  • 5
    This works, but leaves behind column names which are difficult to access programmatically and are not queriable – dmeu Jun 29 '17 at 08:53
  • 1
    This will not work with the latest version of pandas. It works with 0.18 but not with 0.20 (latest as of now) – TH22 Oct 10 '17 at 20:59
  • 2
    @dmeu **to preserve column names** `pd.DataFrame(df.to_records(), columns=df.index.names + list(df.columns))` – Teoretic Aug 03 '18 at 10:13
  • 1
    It is preserving column names as tuples for me, and to keep the index i use: `pd.DataFrame(df_volume.to_records(), index=df_volume.index).drop('index', axis=1)` – Jayen Oct 09 '19 at 02:35
50

Andy Hayden's answer is certainly the easiest way -- if you want to avoid duplicate column labels you need to tweak a bit

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993
tzelleke
  • 15,023
  • 5
  • 33
  • 49
35
df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]
tvt173
  • 1,746
  • 19
  • 17
19

The easiest and most intuitive solution for me was to combine the column names using get_level_values. This prevents duplicate column names when you do more than one aggregation on the same column:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

If you want a separator between columns, you can do this. This will return the same thing as Seiji Armstrong's comment on the accepted answer that only includes underscores for columns with values in both index levels:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

I know this does the same thing as Andy Hayden's great answer above, but I think it is a bit more intuitive this way and is easier to remember (so I don't have to keep referring to this thread), especially for novice pandas users.

This method is also more extensible in the case where you may have 3 column levels.

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three
bodily11
  • 584
  • 6
  • 10
19

Yet another short one, using only pandas methods:

df.columns = df.columns.to_flat_index().str.join('_')

Yields as output:

    USAF_  WBAN_  day_  month_  ...  s_PC_sum  tempf_amax  tempf_amin  year_
0  702730  26451     1       1  ...       1.0       30.92       24.98   1993
1  702730  26451     2       1  ...       0.0       32.00       24.98   1993
2  702730  26451     3       1  ...       1.0       23.00        6.98   1993
3  702730  26451     4       1  ...       1.0       10.04        3.92   1993
4  702730  26451     5       1  ...       3.0       19.94       10.94   1993

You'll notice the trailing underscore for columns that were not part of a MultiIndex. You mentioned that you do not care about the name, so that might work for you. In my own similar use case all the columns had two levels, so this simple command created nice names.

odedbd
  • 2,285
  • 3
  • 25
  • 33
18

And if you want to retain any of the aggregation info from the second level of the multiindex you can try this:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols
Mathew Savage
  • 168
  • 11
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
18

The most pythonic way to do this to use map function.

df.columns = df.columns.map(' '.join).str.strip()

Output print(df.columns):

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

Update using Python 3.6+ with f string:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

Output:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
11

After reading through all the answers, I came up with this:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

Usage:

Given a data frame:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • Single aggregation method: resulting variables named the same as source:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    
    • Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index()
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
      
  • Single source variable, multiple aggregations: resulting variables named after statistics:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    
    • Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index().
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
      
  • Multiple variables, multiple aggregations: resulting variables named (varname)_(statname):

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    
    • Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns).
    • If you don't have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns)
    • To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don't understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns)
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
      
  • You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23)

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    
    • Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements.
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12
      

Cases handled by the helper function

  • level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..)
  • they can also be empty, so we have to filter(None, ..)
  • for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples)
  • depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels
  • (since I'm new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default
Nickolay
  • 31,095
  • 13
  • 107
  • 185
  • really great answer, can you please explain the working on of '["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]', thanks in advance – Vineet Jul 26 '18 at 13:05
  • @Vineet I updated my post to indicate that I mentioned that snippet to suggest it has a similar effect to my solution. If you want details on why `tuple()` is needed, you might want to comment on jxstanford's post. Otherwise, it might be helpful to inspect the `.columns.values` in the provided example: `[('val1', 'min'), (2, 'sum'), (2, 'size')]`. 1) `for t in a.columns.values` loops over the columns, for the second column `t == (2, 'sum')`; 2) `map(str, t)` applies `str()` to each "level", resulting in `('2', 'sum')`; 3) `"_".join(('2','sum'))` results in "2_sum", – Nickolay Jul 26 '18 at 17:07
8

A general solution that handles multiple levels and mixed types:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]
jxstanford
  • 3,339
  • 3
  • 27
  • 39
  • 1
    In case there are non-hierarchical columns as well: `df.columns = ['_'.join(tuple(map(str, t))).rstrip('_') for t in df.columns.values]` – Nolan Conaway Aug 10 '17 at 20:30
  • Thanks. Was searching for a long. Since my Multilevel index contained integer values. It resolved my problem :) – AnksG Dec 06 '18 at 12:17
5

A bit late maybe, but if you are not worried about duplicate column names:

df.columns = df.columns.tolist()
Niels
  • 182
  • 2
  • 9
3

In case you want to have a separator in the name between levels, this function works well.

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)
agartland
  • 1,654
  • 2
  • 15
  • 19
  • 1
    I like it. Leaving out the case where the columns are not hierarchical this can be simplified a lot: `df.columns = ["_".join(filter(None, c)) for c in df.columns]` – Gigo Oct 19 '16 at 22:41
3

Following @jxstanford and @tvt173, I wrote a quick function which should do the trick, regardless of string/int column names:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df
Nolan Conaway
  • 2,639
  • 1
  • 26
  • 42
3

I'll share a straight-forward way that worked for me.

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed
Lean Bravo
  • 361
  • 3
  • 5
2

To flatten a MultiIndex inside a chain of other DataFrame methods, define a function like this:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

Then use the pipe method to apply this function in the chain of DataFrame methods, after groupby and agg but before any other methods in the chain:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')
ianmcook
  • 537
  • 4
  • 10
1

You could also do as below. Consider df to be your dataframe and assume a two level index (as is the case in your example)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]
Holy cow
  • 21
  • 1
1

Another simple routine.

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns
Ufos
  • 3,083
  • 2
  • 32
  • 36
  • good alternative, works with `return df` at the end of the code. – Marukox Nov 06 '20 at 18:39
  • @Marukox, note that pandas dataframes in python are mutable. You can operate on them without copying/returning. Which is also why this function has no return statement. – Ufos Nov 14 '20 at 14:26
-1

I found myself with the same question of how to flatten the column names after agg, but I also wanted to preserve the aggregation type as a row name.

To do that, you can use stack(). The result is column names that are flat but also the preservation of the aggregation type. Now you can safely export to a csv, for example.

enter image description here

Bob Baxley
  • 3,551
  • 1
  • 22
  • 28