Selecting columns from pandas MultiIndex

Question

I have DataFrame with MultiIndex columns that looks like this:

# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

sample data

What is the proper, simple way of selecting only specific columns (e.g. ['a', 'c'], not a range) from the second level?

Currently I am doing it like this:

import itertools
tuples = [i for i in itertools.product(['one', 'two'], ['a', 'c'])]
new_index = pd.MultiIndex.from_tuples(tuples)
print(new_index)
data.reindex_axis(new_index, axis=1)

expected result

It doesn't feel like a good solution, however, because I have to bust out itertools, build another MultiIndex by hand and then reindex (and my actual code is even messier, since the column lists aren't so simple to fetch). I am pretty sure there has to be some ix or xs way of doing this, but everything I tried resulted in errors.

Not, I haven't. You mean to more quickly construct the MultiIndex? If so, that's not the point - I would like to avoid it and index directly with something like `data.xs(['a', 'c'], axis=1, level=1)` — metakermit, Aug 27 '13 at 16:04
Is there a reason you have that level as the second and not the first level? — BrenBarn, Aug 27 '13 at 16:06
It's more intuitive to me visually for the kind of data I have. Also, I wanted to learn how to do it generically - for an arbitrary level. — metakermit, Aug 28 '13 at 09:39
In later versions of pandas, you can use `loc` along with the `pd.IndexSlice` API which is now the preferred way of slicing MultIndexs. See [this answer](https://stackoverflow.com/a/54337009/4909087), and [this post.](https://stackoverflow.com/questions/53927460/how-do-i-slice-or-filter-multiindex-dataframe-levels/53927461#53927461) — cs95, Jan 23 '19 at 23:03

Guilherme Salomé · Answer 1 · 2020-01-14T21:45:29.673

The most straightforward way is with .loc:

>>> data.loc[:, (['one', 'two'], ['a', 'b'])]


   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

Remember that [] and () have special meaning when dealing with a MultiIndex object:

(...) a tuple is interpreted as one multi-level key

(...) a list is used to specify several keys [on the same level]

(...) a tuple of lists refer to several values within a level

When we write (['one', 'two'], ['a', 'b']), the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex. The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex.

Edit 1: Another possibility is to use slice(None) to specify that we want anything from the first level (works similarly to slicing with : in lists). And then specify which columns from the second level we want.

>>> data.loc[:, (slice(None), ["a", "b"])]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

If the syntax slice(None) does appeal to you, then another possibility is to use pd.IndexSlice, which helps slicing frames with more elaborate indices.

>>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

When using pd.IndexSlice, we can use : as usual to slice the frame.

Source: MultiIndex / Advanced Indexing, How to use slice(None)

note that the column name of the resulting DataFrame is `a b a b` and not `a c a c`. — Silvan Mühlemann, Jan 14 '20 at 19:09
@SilvanMühlemann I fixed, please take a look and let me know if there are other issues! Thanks for helping. — Guilherme Salomé, Jan 14 '20 at 21:45

score 31 · Accepted Answer · edited Jan 27 '19 at 12:58

31

It's not great, but maybe:

>>> data
        one                           two                    
          a         b         c         a         b         c
0 -0.927134 -1.204302  0.711426  0.854065 -0.608661  1.140052
1 -0.690745  0.517359 -0.631856  0.178464 -0.312543 -0.418541
2  1.086432  0.194193  0.808235 -0.418109  1.055057  1.886883
3 -0.373822 -0.012812  1.329105  1.774723 -2.229428 -0.617690
>>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
        one                 two          
          a         c         a         c
0 -0.927134  0.711426  0.854065  1.140052
1 -0.690745 -0.631856  0.178464 -0.418541
2  1.086432  0.808235 -0.418109  1.886883
3 -0.373822  1.329105  1.774723 -0.617690

would work?

edited Jan 27 '19 at 12:58

metakermit

21,267
15
86
95

answered Aug 27 '13 at 16:22

DSM

342,061
65
592
494

Actually I think this is the optimal way of filtering out a list of labels in an arbitrary level of MultiIndex without creating all the tuples. I would just use `loc` for clarity. – Viktor Kerkez Aug 27 '13 at 16:55
To preserve the order of columns, it is better to use `isin(["a", "b"])`. – Peaceful Apr 14 '17 at 14:39
@Peaceful: what? That doesn't change anything. The result of the isin call is a bool Series, and its order is determined by the order of the original Series, not the argument to isin. – DSM Apr 14 '17 at 14:46
I tried it. And because `{"a", "b"}` is dictionary, it gave me columns ordered as `{"b", "a"}`. Of course I had different column names. What is going on? – Peaceful Apr 14 '17 at 14:52
`{"a", "b"}` is a set, not a dictionary, and that has nothing to do with how isin works. If you have a question about how pandas is behaving, please open a new question instead of commenting on a four-year-old answer. – DSM Apr 14 '17 at 14:54
I meant set only. Sorry for that. But I do get that behavior. And using list instead of set solves the issue. I only casually asked what was going on. It is allowed to comment I guess. – Peaceful Apr 14 '17 at 14:57
@Peaceful: no, you don't. Not with `isin` -- that's not how it works. `x.isin(["a","b"])` is the same as `x.isin(["b","a","a"])` is the same as `x.isin({"a", "b"})`. Again, please open a new question if you have a question about how pandas works, so that the answer can be useful to other people. – DSM Apr 14 '17 at 15:03
Note: `DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing` The search continues... – Evan Oct 12 '18 at 20:30
@Evan I've updated the answer to use .loc to account for the deprecation. I think the solution based on `pd.IndexSlice` suggested by Nick P is more in pandas spirit, though. – metakermit Jan 27 '19 at 13:06
This is a saviour – godimedia Nov 23 '20 at 22:10

score 19 · Answer 3 · answered Aug 27 '13 at 16:16

19

You can use either, loc or ix I'll show an example with loc:

data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]

When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns. So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:

data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]

answered Aug 27 '13 at 16:16

Viktor Kerkez

45,070
12
104
85

And even `.loc` and similar are not necessary. `data[[('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]` works as well. Tested on 0.23.4. – Anton Tarasenko Mar 21 '19 at 17:09

cs95 · Answer 4 · 2019-07-19T01:06:34.693

`ix` and `select` are deprecated!

The use of pd.IndexSlice makes loc a more preferable option to ix and select.

`DataFrame.loc` with `pd.IndexSlice`

# Setup
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame('x', index=range(4), columns=col)
data

  one       two      
    a  b  c   a  b  c
0   x  x  x   x  x  x
1   x  x  x   x  x  x
2   x  x  x   x  x  x
3   x  x  x   x  x  x

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

You can alternatively an axis parameter to loc to make it explicit which axis you're indexing from:

data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

`MultiIndex.get_level_values`

Calling data.columns.get_level_values to filter with loc is another option:

data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

This can naturally allow for filtering on any conditional expression on a single level. Here's a random example with lexicographical filtering:

data.loc[:, data.columns.get_level_values(1) > 'b']

  one two
    c   c
0   x   x
1   x   x
2   x   x
3   x   x

More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame.

Both approaches work for me, but the latter seems to be faster. I observed `pd.IndexSlice` three times as long (at least with my dataset, which has a two level column multiindex and a shape of `(3610, 30)`). --> `pd.IndexSlice` with `670 µs ± 4.49 µs per loop` and `data.loc[:, data.columns.get_level_values(1).isin(['a', 'b', 'c'])]` with `215 µs ± 3.05 µs per loop` — Pascal, Nov 15 '21 at 08:20
also: `pd.IndexSlice` does not preserve the order of the columns in my case (`pandas==1.2.4`), the second does. — Pascal, Nov 15 '21 at 08:28

score 17 · Answer 5 · answered Oct 11 '15 at 18:19

I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:

data.select(lambda x: x[1] in ['a', 'b'], axis=1)

gives your expected output in a quick and clean one-liner:

        one                 two          
          a         b         a         b
0 -0.341326  0.374504  0.534559  0.429019
1  0.272518  0.116542 -0.085850 -0.330562
2  1.982431 -0.420668 -0.444052  1.049747
3  0.162984 -0.898307  1.762208 -0.101360

It is mostly self-explaining, the [1] refers to the level.

Note: `FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement.` — Evan, Oct 12 '18 at 20:25

score 11 · Answer 6 · answered Jun 17 '16 at 03:43

To select all columns named 'a' and 'c' at the second level of your column indexer, you can use slicers:

>>> data.loc[:, (slice(None), ('a', 'c'))]

        one                 two          
          a         c         a         c
0 -0.983172 -2.495022 -0.967064  0.124740
1  0.282661 -0.729463 -0.864767  1.716009
2  0.942445  1.276769 -0.595756 -0.973924
3  2.182908 -0.267660  0.281916 -0.587835

Here you can read more about slicers.

score 3 · Answer 7 · answered Aug 22 '18 at 12:51

A slightly easier, to my mind, riff on Marc P.'s answer using slice:

import pandas as pd
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

        one                 two          
          a         c         a         c
0 -1.731008  0.718260 -1.088025 -1.489936
1 -0.681189  1.055909  1.825839  0.149438
2 -1.674623  0.769062  1.857317  0.756074
3  0.408313  1.291998  0.833145 -0.471879

As of pandas 0.21 or so, .select is deprecated in favour of .loc.

score 1 · Answer 8 · answered Mar 29 '22 at 23:24

1

Use df.loc(axis="columns") (or df.loc(axis=1) to access just the columns and slice away:

df.loc(axis="columns")[:, ["a", "c"]]

answered Mar 29 '22 at 23:24

william_grisaitis

5,170
3
33
40

Or equivalently if you're going for brevity, `axis=1`. This is by far the most visually appealing (and probably most concise) solution imho. – fantabolous Apr 28 '23 at 06:23
for sure :) my brain is slow and forgetful sometimes, so i tend to do the human readable thing lol – william_grisaitis May 02 '23 at 14:47

Nevertree · Answer 9 · 2022-11-14T11:23:41.597

For arbitrary level of the column value

If the level of the column index shall be arbitrary, this might help you a bit:

class DataFrameMultiColumn(pd.DataFrame) :
    def loc_multicolumn(self, keys):
        depth = lambda L: isinstance(L, list) and max(map(depth, L))+1
        
        result = []
        col = self.columns
        
        # if depth of keys is 1, all keys need to be true
        if depth(keys) == 1:
            for c in col:
                # select all columns which contain all keys
                if set(keys).issubset(set(c)) : 
                    result.append(c)
        # depth of 2 indicates, 
        # the product of all sublists will be formed
        elif depth(keys) == 2 :
            keys = list(itertools.product(*keys)) 
            for c in col:
                for k in keys :
                    # select all columns which contain all keys
                    if set(k).issubset(set(c)) : 
                        result.append(c)
                        
        else :
            raise ValueError("Depth of the keys list exceeds 2")

        # return with .loc command
        return self.loc[:,result]

.loc_multicolumn will return the same as calling .loc but without specifing the level for each key. Please note that this might be a problem is values are the same in multiple column levels!

Example :

Sample data:

np.random.seed(1)
    col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                    ['a', 'b', 'c', 'a', 'b', 'c']])
    data = pd.DataFrame(np.random.randint(0, 10, (4,6)), columns=col)
    data_mc = DataFrameMultiColumn(data)

>>> data_mc
      one       two      
        a  b  c   a  b  c
    0   5  8  9   5  0  0
    1   1  7  6   9  2  4
    2   5  2  4   2  4  7
    3   7  9  1   7  0  6

Cases:

List depth 1 requires all elements in the list be fit.

>>> data_mc.loc_multicolumn(['a', 'one'])
  one
    a
0   5
1   1
2   5
3   7
>>> data_mc.loc_multicolumn(['a', 'b'])

Empty DataFrame
Columns: []

Index: [0, 1, 2, 3]

>>> data_mc.loc_multicolumn(['one','a', 'b'])
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

List depth 2 allows all elements of the Cartesian product of keys list.

>>> data_mc.loc_multicolumn([['a', 'b']])
  one    two   
    a  b   a  b
0   5  8   5  0
1   1  7   9  2
2   5  2   2  4
3   7  9   7  0
    
>>> data_mc.loc_multicolumn([['one'],['a', 'b']])
  one   
    a  b
0   5  8
1   1  7
2   5  2
3   7  9

For the last: All combination from list(itertools.product(["one"], ['a', 'b'])) are given if all elements in the combination fits.

sammywemmy · Answer 10 · 2023-01-29T02:03:47.613

One option is with select_columns from pyjanitor, where you can use a dictionary to select - the key of the dictionary is the level (either a number or label), and the value is the label(s) to be selected:

# pip install pyjanitor
import pandas as pd
import janitor
data.select_columns({1:['a','c']})

        one                 two          
          a         c         a         c
0 -0.089182 -0.523464 -0.494476  0.281698
1  0.968430 -1.900191 -0.207842 -0.623020
2  0.087030 -0.093328 -0.861414 -0.021726
3 -0.952484 -1.149399  0.035582  0.922857

Uwe Mayer · Answer 11 · 2022-06-02T23:30:32.780

The .loc[:, list of column tuples] approach given in one of the earlier answers fails in case the multi-index has boolean values, as in the example below:

col = pd.MultiIndex.from_arrays([[False, False, True,  True],
                                 [False, True,  False, True]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[(False, True),(True, False)]]

This fails with a ValueError: PandasArray must be 1-dimensional.

Compare this to the following example, where the index values are strings and not boolean:

col = pd.MultiIndex.from_arrays([["False", "False", "True",  "True"],
                                 ["False", "True",  "False", "True"]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[("False", "True"),("True", "False")]]

This works fine.

You can transform the first (boolean) scenario to the second (string) scenario with

data.columns = pd.MultiIndex.from_tuples([(str(i),str(j)) for i,j in data.columns],
    names=data.columns.names)

and then access with string instead of boolean column index values (the names=data.columns.names parameter is optional and not relevant to this example). This example has a two-level column index, if you have more levels adjust this code correspondingly.

Getting a boolean multi-level column index arises, for example, if one does a crosstab where the columns result from two or more comparisons.

Ces · Answer 12 · 2022-06-28T08:29:02.327

0

Two answers are here depending on what is the exact output that you need.

If you want to get a one leveled dataframe from your selection (which can be sometimes really useful) simply use :

df.xs('theColumnYouNeed', level=1, axis=1)

If you want to keep the multiindex form (similar to metakermit's answer) :

data.loc[:, data.columns.get_level_values(1) == "columnName"]

Hope this will help someone

edited Jun 28 '22 at 08:29

answered Jun 28 '22 at 08:26

Ces

76
1
7

score 0 · Answer 13 · answered Aug 22 '22 at 07:19

Rename columns before selecting

Sample dataframe

import pandas as pd
import numpy as np
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

rename columns

data.columns = ['_'.join(x) for x in data.columns]
data

Subset column

data['one_a']

Selecting columns from pandas MultiIndex

13 Answers13

`ix` and `select` are deprecated!

`DataFrame.loc` with `pd.IndexSlice`

`MultiIndex.get_level_values`

For arbitrary level of the column value

Example :

Sample data:

Cases:

Rename columns before selecting

Linked

Related

Selecting columns from pandas MultiIndex

13 Answers13

ix and select are deprecated!

DataFrame.loc with pd.IndexSlice

MultiIndex.get_level_values

For arbitrary level of the column value

Example :

Sample data:

Cases:

Rename columns before selecting

Linked

Related

`ix` and `select` are deprecated!

`DataFrame.loc` with `pd.IndexSlice`

`MultiIndex.get_level_values`