1

I observed some strange behaviour of the column names when using to_flat_index() function.

Starting with a MultiIndex dataframe

a=[0,.25, .5, .75]
b=[1, 2, 3, 4]
c=[5, 6, 7, 8]
d=[1, 2, 3, 5]
df=pd.DataFrame(data={('a','a'):a, ('b', 'b'):b, ('c', 'c'):c, ('d', 'd'):d})

Produces this dataframe

      a  b  c  d
      a  b  c  d
0  0.00  1  5  1
1  0.25  2  6  2
2  0.50  3  7  3
3  0.75  4  8  5

Use the .to_flat_index to flatten the index

df.columns = df.columns.to_flat_index()

Produces the following dataframe

    (a, a)  (b, b)  (c, c)  (d, d)
0   0.00    1   5   1
1   0.25    2   6   2
2   0.50    3   7   3
3   0.75    4   8   5

If I try to select a column using df['(a, a)'] method I get a KeyError message. If I try to clean up the column name using df.columns = df.columns.str.lower().str.rstrip() (or any other .str method) I get nan instead of column names

NaN NaN NaN NaN
0   0.00    1   5   1
1   0.25    2   6   2
2   0.50    3   7   3
3   0.75    4   8   5

What am I doing wrong. How can I select the column after using to_flat_index()?

Eldar Vagapov
  • 63
  • 1
  • 7
  • https://stackoverflow.com/questions/14507794/pandas-how-to-flatten-a-hierarchical-index-in-columns This thread might be helpful – Varsha Mar 03 '20 at 22:20

2 Answers2

3

If you want to select one column you have to do it with a tuple:

df[('a', 'a')]

If you only want one level column with the first element of the tuple as the name this should be enough:

df.columns = df.columns.to_flat_index().map(lambda x: x[0])
jjsantoso
  • 1,586
  • 1
  • 12
  • 17
0

With to_flat_index() you are converting the columns into a list of tuple, where elements of tuples are the level values of multinindex.

For the given multindex after flattening, the new columns are [('a', 'a'), ('b', 'b'), ('c', 'c'), ('d', 'd')]

So when you are applying string method's on the values of columns e.g. ('a', 'a') will return None which becomes NaN.

If you're trying to extract one specific level of values you can do so by,

df.columns = df.columns.get_level_values(0) # level index

Assuming a complicated operation on the multiple levels of data frame

join_str_with_char = lambda mi_cols, jstr: jstr.join([_.strip() for _ in mi_cols])

df.columns = df.columns.map(lambda x: join_str_with_char(x,'_'))
DOOM
  • 1,170
  • 6
  • 20