9

Let's say we have a DataFrame with multiple levels of column headers.

level_0         A                   B                   C          
level_1         P                   P                   P          
level_2         x         y         x         y         x         y
0       -1.027155  0.667489  0.314387 -0.428607  1.277167 -1.328771
1        0.223407 -1.713410  0.480903 -3.517518 -1.412756  0.718804

I want to select a list of columns from a named level.

required_columns = ['A', 'B']
required_level = 'level_0'

Method 1: (deprecated in favor of df.loc)

print df.select(lambda x: x[0] in required_columns, axis=1)

The problem with this is that I have to specify the level with 0. It fails if I use the name of the level.

Method 2:

print df.xs('A', level=required_level, axis=1)

The problem with this is that I can only specify a single value. It fails if I use ['A', 'B'].

Method 3:

print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]

This works, but isn't as concise as the previous two methods! :)

Question:

How can I get method 1 or 2 to work? Or, is there a more pythonic way?

The MWE:

import pandas as pd
import numpy as np

header = pd.MultiIndex.from_product([['A', 'B', 'C'],
                                     ['P'],
                                     ['x', 'y']],
                                    names=['level_0',
                                           'level_1',
                                           'level_2'])
df = pd.DataFrame(
    np.random.randn(2, 6),
    columns=header
)

required_columns = ['A', 'B']
required_level = 'level_0'

print df
print df.select(lambda x: x[0] in required_columns, axis=1)
print df.xs('A', level=required_level, axis=1)
print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]

Related questions:

  1. pandas dataframe select columns in multiindex
  2. Giving a column multiple indexes/headers
smci
  • 32,567
  • 20
  • 113
  • 146
bluprince13
  • 4,607
  • 12
  • 44
  • 91
  • While it's surprising there is not a better way, I think your method 3 is the best available at them moment. It's very readable. – Pablojim Aug 14 '17 at 08:27
  • 2
    fyi, `ix` was deprecated in pandas 0.20.0 https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-ix – Alexander Aug 16 '17 at 23:19
  • 1
    `df.select()` is deprecated in favor of `df.loc()` since pandas 0.21 – smci Dec 18 '17 at 19:22

2 Answers2

7

You can use reindex:

df.reindex(columns=required_columns, level=required_level)

The resulting output:

level_0         A                   B          
level_1         P                   P          
level_2         x         y         x         y
0       -1.265558  0.681565 -0.553084 -1.340652
1        1.705043 -0.512333 -0.785326  0.968391 
root
  • 32,715
  • 6
  • 74
  • 87
  • 1
    `reindex` is nice, but of course it would only return a view of the dataframe. Any subsequent modifications wouldn't be reflected in the original. This might be fine, but really depends on the use case and what the OP is actually trying to achieve. – Alexander Aug 17 '17 at 01:43
3

Have you considered using IndexSlice? It generally requires the columns to first be sorted (in the original dataframe, they were already sorted).

df.sort_index(axis=1, inplace=True)
>>> df.loc[:, pd.IndexSlice[required_columns, :, :]]
# Output:
# level_0         A                   B          
# level_1         P                   P          
# level_2         x         y         x         y
# 0        0.079368 -1.083421  0.129979 -0.558004
# 1       -0.157843 -1.176632 -0.219833  0.935364

Update

The method you choose really depends why you are selecting your data in the first place and whether or not you need to modify your original data via your selection.

First, to make the example a little more challenging, let's use a MultiIndex dataframe that has the same values across different levels and that is unsorted.

required_columns = ['A', 'B']  # Per original question.
required_level = 'level_0'  # Per original question.

np.random.seed(0)
idx = pd.MultiIndex.from_product([list('BAC'), list('AB')], names=['level_0', 'level_1'])
df = pd.DataFrame(np.random.randn(2, len(idx)), columns=idx)
>>> df
# Output:
# level_0         B                   A                   C          
# level_1         A         B         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
# 1        0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274

Return a copy of the data

If you only need to view the data, either directly or for subsequent calculations in a pipeline, then the reindex method mentioned by @root and discussed here in the documentation is a good option.

df2 = df.reindex(columns=required_columns, level=required_level)
>>> df2
# Output:
# level_0         A                   B          
# level_1         A         B         A         B
# 0        0.978738  2.240893  1.764052  0.400157
# 1       -0.103219  0.410599  0.950088 -0.151357

However, if you try to modify this dataframe, the changes won't be reflected in your original.

df2.iloc[0, 0] = np.nan
>>> df  # Check values in original dataframe.  None are `NaN`.
# Output:
# level_0         B                   A                   C          
# level_1         A         B         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
# 1        0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274

Modify the data

An alternative method is to use boolean indexing with loc. You can use a conditional list comprehension to select the desired columns together with get_level_values:

cols = [col in required_columns for col in df.columns.get_level_values(required_level)]
>>> df.loc[:, cols]
# Output:
# level_0         B                   A          
# level_1         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893
# 1        0.950088 -0.151357 -0.103219  0.410599

If you are slicing the index instead of the columns, then one would obviously need to change df.columns.get_level_values to df.index.get_level_values in the code snippet above.

You can also modify the original data using loc:

df2 = df.copy()
df2.loc[:, cols] = 1
>>> df2
# Output:
# level_0  B     A            C          
# level_1  A  B  A  B         A         B
# 0        1  1  1  1  1.867558 -0.977278
# 1        1  1  1  1  0.144044  1.454274

Conclusion

Although select is a good option for returning a view of your multi-indexed data, boolean indexing using loc allows you to view or modify your data.

Instead of Method 1 or Method 2, I would use the loc approach described above.

As of pandas 0.20.0, the ix method has been deprecated. I would not recommend Method 3.

Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Since it is the first one, I think `df.loc[:, pd.IndexSlice[required_columns]]` would also work – ayhan Aug 06 '17 at 21:08
  • 2
    Yes, but using `:, :` reads more clearly that there are two additional levels that are not being sliced (in my opinion). – Alexander Aug 06 '17 at 21:14
  • It would do the job, but I'd like to be able to identify the level by name as I've said in the question. The reason is that I may not always know how many levels there are, or which one I'm after. All I know is the name of the level. – bluprince13 Aug 06 '17 at 21:55
  • If you know the `required_columns`, then why wouldn't @ayhan's comment work? In fact, even `df.loc[:, slice('A','B')]` is viable (no `IndexSlice` needed). – andrew_reece Aug 11 '17 at 18:28
  • @andrew_reece Because you can have the same `required_columns` across different index levels (e.g. column `A` could be in both `level_1' and 'level_2'). – Alexander Aug 16 '17 at 23:21