142

I have a multi-index data frame with columns 'A' and 'B'.

Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?

For Example.

# has multi-index (A,B)
df
#can I do this? I know this doesn't work because the index is multi-index so I need to     specify a tuple

df.ix[df.A ==1]
Ratan Uday Kumar
  • 5,738
  • 6
  • 35
  • 54
silencer
  • 2,285
  • 5
  • 18
  • 18
  • possible duplicate of [How to update a subset of a MultiIndexed pandas DataFrame](http://stackoverflow.com/questions/17552997/how-to-update-a-subset-of-a-multiindexed-pandas-dataframe) – Andy Hayden Sep 16 '13 at 18:54
  • Related: [Select rows in pandas MultiIndex DataFrame](https://stackoverflow.com/q/53927460/4909087) (a broad discussion on the same topic by me). – cs95 May 15 '20 at 09:00

7 Answers7

191

One way is to use the get_level_values Index method:

In [11]: df
Out[11]:
     0
A B
1 4  1
2 5  2
3 6  3

In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
     0
A B
1 4  1

In 0.13 you'll be able to use xs with drop_level argument:

df.xs(1, level='A', drop_level=False) # axis=1 if columns

Note: if this were column MultiIndex rather than index, you could use the same technique:

In [21]: df1 = df.T

In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A  1
B  4
0  1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
89

You can also use query which is very readable in my opinion and straightforward to use:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])

      C
A B    
1 10  6
2 20  7
3 50  8
4 80  9

For what you had in mind you can now simply do:

df.query('A == 1')

      C
A B    
1 10  6

You can also have more complex queries using and

df.query('A >= 1 and B >= 50')

      C
A B    
3 50  8
4 80  9

and or

df.query('A == 1 or B >= 50')

      C
A B    
1 10  6
3 50  8
4 80  9

You can also query on different index levels, e.g.

df.query('A == 1 or C >= 8')

will return

      C
A B    
1 10  6
3 50  8
4 80  9

If you want to use variables inside your query, you can use @:

b_threshold = 20
c_threshold = 8

df.query('B >= @b_threshold and C <= @c_threshold')

      C
A B    
2 20  7
3 50  8
Cleb
  • 25,102
  • 20
  • 116
  • 151
  • 1
    Great answer, way more readable indeed. Do you know if it is possible to query two field on different index level like: `df.query('A == 1 or C >= 8')` – obchardon May 09 '18 at 15:01
  • @obchardon: That seems to work fine; I edited my answer using your example. – Cleb May 09 '18 at 15:05
  • 1
    I have times and strings as multiindex which makes problems in the string expression. However, `df.query()` works just fine with variables if they are refered to with an '@' inside the expression in query, e.g. `df.query('A == @var`) for a variable `var`in the environment. – Solly Oct 26 '18 at 15:41
  • @Solly: Thanks, I added this to the answer. – Cleb Oct 26 '18 at 18:19
  • Where is the multi indexing here though? – Lamma Apr 03 '20 at 12:12
  • @Lamma: `A` and `B` are the multi-index; `C` is just a "normal" column. – Cleb Apr 03 '20 at 12:17
  • @Cleb But you have not set it as a multi index? Or am in wrng assume you need `pd.MultiIndex.from_xxx` for it to be counded as multi index? – Lamma Apr 03 '20 at 12:28
  • @Lamma: `df.set_index(['A', 'B'])` takes care of the multi-index :) – Cleb Apr 03 '20 at 12:29
  • @Cleb How does that make them a sub column of C? – Lamma Apr 03 '20 at 12:30
  • @Cleb Secondly do oyu know how to make a column that is multi index with no lower level index to the the row index so can can call `df['apples']` and it iwll show the row with apples in the indexed column? – Lamma Apr 03 '20 at 12:33
  • @Lamma: I don't know what you mean by "sub column". I think it is best if you opened a new question describing the problem in detail using an example. The comment section is not the best place for that. – Cleb Apr 04 '20 at 07:36
  • It is really readable and straightforward, but I timed it and found it's considerably slower than the other alternatives. – ASB Mar 26 '21 at 23:17
  • @ASB: The efficiency depends on the size of your dataset; see detailed timings [in this answer](https://stackoverflow.com/a/46165056/1534017). – Cleb Mar 27 '21 at 07:42
  • Oh wow `query` is so cool! I tried passing an `@col_name` and it didn't work, however you can do: `values = [1, 2] col_name = 'var_1' df.query(f'{col_name} == @values[0] or {col_name} == @values[1]')` and it works like a charm! – Alexis Cllmb May 21 '22 at 23:16
39

You can use DataFrame.xs():

In [36]: df = DataFrame(np.random.randn(10, 4))

In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]

In [38]: df.columns.names = ['A', 'B']

In [39]: df
Out[39]:
A      b             a
B      d      d      d      d
0 -1.406  0.548 -0.635  0.576
1 -0.212 -0.583  1.012 -1.377
2  0.951 -0.349 -0.477 -1.230
3  0.451 -0.168  0.949  0.545
4 -0.362 -0.855  1.676 -2.881
5  1.283  1.027  0.085 -1.282
6  0.583 -1.406  0.327 -0.146
7 -0.518 -0.480  0.139  0.851
8 -0.030 -0.630 -1.534  0.534
9  0.246 -1.558 -1.885 -1.543

In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B      d      d
0 -0.635  0.576
1  1.012 -1.377
2 -0.477 -1.230
3  0.949  0.545
4  1.676 -2.881
5  0.085 -1.282
6  0.327 -0.146
7  0.139  0.851
8 -1.534  0.534
9 -1.885 -1.543

If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):

In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A      a
B      d      d
0 -0.635  0.576
1  1.012 -1.377
2 -0.477 -1.230
3  0.949  0.545
4  1.676 -2.881
5  0.085 -1.282
6  0.327 -0.146
7  0.139  0.851
8 -1.534  0.534
9 -1.885 -1.543
robertspierre
  • 3,218
  • 2
  • 31
  • 46
Phillip Cloud
  • 24,919
  • 11
  • 68
  • 88
26

Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.

Copy paste this in your code to generate example:

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Will give you table like this:

enter image description here

Standard access by column

health_data['Bob']
type       HR   Temp
year visit      
2013    1   22.0    38.6
        2   52.0    38.3
2014    1   30.0    38.9
        2   31.0    37.3


health_data['Bob']['HR']
year  visit
2013  1        22.0
      2        52.0
2014  1        30.0
      2        31.0
Name: HR, dtype: float64

# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year  visit
2013  1         True
      2        False
2014  1        False
      2        False

health_data['Bob']['HR'][2013]    
visit
1    22.0
2    52.0
Name: HR, dtype: float64

health_data['Bob']['HR'][2013][1]
22.0

Access by row

health_data.loc[2013]
subject Bob Guido   Sue
type    HR  Temp    HR  Temp    HR  Temp
visit                       
1   22.0    38.6    40.0    38.9    53.0    37.5
2   52.0    38.3    42.0    34.6    30.0    37.7

health_data.loc[2013,1] 
subject  type
Bob      HR      22.0
         Temp    38.6
Guido    HR      40.0
         Temp    38.9
Sue      HR      53.0
         Temp    37.5
Name: (2013, 1), dtype: float64

health_data.loc[2013,1]['Bob']
type
HR      22.0
Temp    38.6
Name: (2013, 1), dtype: float64

health_data.loc[2013,1]['Bob']['HR']
22.0

Slicing multi-index

idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
    subject Bob Guido   Sue
type    HR  HR  HR
year    visit           
2013    1   22.0    40.0    53.0
2014    1   30.0    52.0    45.0
Hrvoje
  • 13,566
  • 7
  • 90
  • 104
  • 1
    this gives `ValueError: cannot handle a non-unique multi-index!` error – Coddy Aug 21 '20 at 20:15
  • in your access by column, how would you do say Bob&HR with Guido &HR in one go? – user3697498 Apr 21 '21 at 19:15
  • @user3697498 you can use pandas query with multiple conditions: https://kanoki.org/2020/01/21/pandas-dataframe-filter-with-multiple-conditions/ – Hrvoje Apr 21 '21 at 20:13
8

You can use DataFrame.loc:

>>> df.loc[1]

Example

>>> print(df)
       result
A B C        
1 1 1       6
    2       9
  2 1       8
    2      11
2 1 1       7
    2      10
  2 1       9
    2      12

>>> print(df.loc[1])
     result
B C        
1 1       6
  2       9
2 1       8
  2      11

>>> print(df.loc[2, 1])
   result
C        
1       7
2      10
a_guest
  • 34,165
  • 12
  • 64
  • 118
umn
  • 431
  • 6
  • 17
  • This is the best of the modern approaches IMO, where df.loc[2, 1]['result'] will handle multi-columns – M__ Sep 09 '19 at 17:11
  • this works with any number of integers for some reason. e.g. `df.loc[0], df.loc[1]....df.loc[n]` – Coddy Aug 21 '20 at 20:16
5

Another option is:

filter1 = df.index.get_level_values('A') == 1
filter2 = df.index.get_level_values('B') == 4

df.iloc[filter1 & filter2]
Out[11]:
     0
A B
1 4  1
Gonzalo Goral
  • 51
  • 1
  • 1
4

You can use MultiIndex slicing. For example:

arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
          ["one", "two", "one", "two", "one", "two", "one", "two"]]   
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"])
df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"])

         col1  col2
A   B              
bar one     0     8
    two     4     8
baz one     6     0
    two     7     3
foo one     6     8
    two     2     6
qux one     7     0
    two     6     4

To select all from A and two from B:

df.loc[(slice(None), 'two'), :]

Output:

         col1  col2
A   B              
bar two     4     8
baz two     7     3
foo two     2     6
qux two     6     4

To select bar and baz from A and two from B:

df.loc[(['bar', 'baz'], 'two'), :]

Output:

         col1  col2
A   B              
bar two     4     8
baz two     7     3
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73