Filtering multiple items in a multi-index Python Panda dataframe

Question

I have the following table:

NSRCODE  PBL_AWI          Area           
CM       BONS             44705.492941
         BTNN            253854.591990
         FONG             41625.590370
         FONS             16814.159680
         Lake             57124.819333
         River             1603.906642
         SONS            583958.444751
         STNN             45603.837177
         clearcut        106139.013930
         disturbed       127719.865675
         lowland         118795.578059
         upland         2701289.270193
LBH      BFNN            289207.169650
         BONS           9140084.716743
         BTNI             33713.160390
         BTNN          19748004.789040
         FONG           1687122.469691
         FONS           5169959.591270
         FTNI            317251.976160
         FTNN           6536472.869395
         Lake            258046.508310
         River            44262.807900
         SONS           4379097.677405
         burn regen      744773.210860
         clearcut         54066.756790
         disturbed       597561.471686
         lowland       12591619.141842
         upland        23843453.638117

Note: Both NSRCODE and PBL_AWI are indices.

How do I search for values in column PBL_AWI? For example I want to keep the values ['Lake', 'River', 'Upland'].

if your table name is df, then df[df['PBL_AWI'] in ['Lake', 'River', 'Upland']], or df[df['PBL_AWI'] == 'Lake' or df['PBL_AWI'] == ''River or df['PBL_AWI'] == 'Upland'] should do the trick. Filtering is very basic stuff, you should see http://pandas.pydata.org/pandas-docs/stable/tutorials.html — Inox, Aug 10 '14 at 01:07
@Inox I realize filtering is very basic stuff I have done it before on single index dataframes. However the multi-index dataframe when trying what you have above produces "KeyError: u'no item named PBL_AWI'" — Tristan Forward, Aug 10 '14 at 02:01
Though @CTZhu has the right answer, note that you could use a solution like @Inox's by just doing `df = df.reset_index()` first. I often resort to this with a MultiIndex if I can't find a better solution. But I'm finding there usually is a better solution, it's just not very discoverable. — ontologist, Aug 10 '14 at 05:22
If you are interested in learning more about slicing and filtering multiindex DataFrames, please take a look at my post: [How do I slice or filter MultiIndex DataFrame levels?](https://stackoverflow.com/questions/53927460/how-do-i-slice-or-filter-multiindex-dataframe-levels). Thanks! — cs95, Jan 05 '19 at 07:04

CT Zhu · Accepted Answer · 2014-08-10T02:41:59.200

141

You can get_level_values in conjunction with Boolean slicing.

In [50]:

print df[np.in1d(df.index.get_level_values(1), ['Lake', 'River', 'Upland'])]
                          Area
NSRCODE PBL_AWI               
CM      Lake      57124.819333
        River      1603.906642
LBH     Lake     258046.508310
        River     44262.807900

The same idea can be expressed in many different ways, such as df[df.index.get_level_values('PBL_AWI').isin(['Lake', 'River', 'Upland'])]

Note that you have 'upland' in your data instead of 'Upland'

edited Aug 10 '14 at 02:41

answered Aug 10 '14 at 02:36

CT Zhu

52,648
17
120
133

3

I find the second solution useful for a similar problem, selecting rows where multi-index values greater then given number: `df[df.index.get_level_values('level_name') > 1]`. – Dmitriy Work May 20 '20 at 08:50
2

2nd solution should be the answer, clearer and does not require np. – jmoz Mar 03 '21 at 09:29
would be interesting to see what's faster though. np is already a dependency of pandas so hardly a problem. – MrR Apr 26 '21 at 13:54

score 72 · Answer 2 · edited Oct 29 '20 at 11:14

72

Another (maybe cleaner) way might be this one:

print(df[df.index.isin(['Lake', 'River', 'Upland'], level=1)])

The parameter level specifies the index number (starting with 0) or index name (here: level='PBL_AWI')

edited Oct 29 '20 at 11:14

dspencer

4,297
4
22
43

answered Apr 13 '20 at 17:26

Nate

856
6
5

1

While this code may solve the question, [including an explanation](https://meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Brian61354270 Apr 13 '20 at 18:27
4

In this context, where there is already rich discussion of the issues, and if the respondent didn't feel like writing a textbook chapter on the subject, I still find this answer useful. – Michael Tuchman Apr 29 '20 at 00:21
Awsome! Using this code, is possible to apply on multiple levels? I want "X" on level 1 and "Y" on level 2. – FábioRB Feb 07 '23 at 19:51
1

I believe you'll need to apply the same code for each level separately (there is no option to write `level=[1,2]` for instance). More details can be found in the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.Index.isin.html) – Nate Feb 24 '23 at 19:27
1

Yep, @FábioRB, to do this with multiple indices, you'd repeat the pattern, but could still do it in one "line". `df[df.index.isin(['Lake', 'River', 'Upland'], level='PBL_AWI') & df.index.isin(['CM'], level='NSRCODE')]` – Kaleb Coberly Feb 26 '23 at 23:29

score 23 · Answer 3 · edited Apr 05 '22 at 07:29

23

A simpler approach using .loc would be

df.loc[(slice(None),['Lake', 'River', 'Upland']),:]

or for Series

df.loc[(slice(None),['Lake', 'River', 'Upland'])]

slice(None) means no filtering on the first level index. We can filter the second level index using a list of values ['Lake', 'River', 'Upland']

edited Apr 05 '22 at 07:29

Skippy le Grand Gourou

6,976
4
60
76

answered Jul 21 '21 at 12:37

Neeraz Lakkapragada

587
6
11

6

This should be the accepted answer: indices are by definition intended for doing this kind of direct lookup, without needing to create intermediate boolean arrays. – Gord Stephen Sep 08 '21 at 20:24
2

Did not work for me at first, here is what did: `df.loc[(slice(None), ["Lake", "River", "Upland"]), :]` – Sebastian Hätälä Jan 19 '22 at 09:38
Note that if you provide a single value without a list (i.e. `df.loc[(slice(None),"Lake"]`) it will provide the values without the requested index — which is exactly what I needed to convert the last index values into separated columns ! – Skippy le Grand Gourou Apr 04 '22 at 15:23
What if I wanted to filter by the first level? or by both? – Kaleb Coberly Feb 26 '23 at 23:31
@GordStephen, yeah, the intermediate boolean array may add a little more overhead (assuming this solution doesn't also do something similar under the hood), but the real advantage of an index is the sorting that allows for rapid search/filter. Some other solutions still take advantage of that and are more intuitive to use, like the one above using `df.index.isin([...], level=...)`. Whereas others remove the index altogether and completely defeat its purpose. – Kaleb Coberly Feb 26 '23 at 23:38

jedi · Answer 4 · 2021-04-30T19:40:36.373

df.filter(regex=...,axis=...) is even more succinct, because it works on both index=0 and column=1 axis. You do not need to worry about levels, and you can be lazy with regex. Complete example for filter on index:

df.filter(regex='Lake|River|Upland',axis=0)

if you transpose it, and try to filter on columns (axis=1 by default), it works as well:

df.T.filter(regex='Lake|River|Upland')

Now, with regex you can also easily fix upper lower case issue with Upland:

upland = re.compile('Upland', re.IGNORECASE)
df.filter(regex=upland ,axis=0)

This is the command to read above input table:

df = pd.read_csv(io.StringIO(inpute_table), sep="\s{2,}").set_index(['NSRCODE', 'PBL_AWI'])

score 9 · Answer 5 · edited May 23 '17 at 12:26

Also (from here):

def filter_by(df, constraints):
    """Filter MultiIndex by sublevels."""
    indexer = [constraints[name] if name in constraints else slice(None)
               for name in df.index.names]
    return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]

pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by

... to be used as

df.filter_by({'PBL_AWI' : ['Lake', 'River', 'Upland']})

(untested with Panels and higher dimension elements, but I do expect it to work)

rachwa · Answer 6 · 2022-06-21T19:42:59.743

You can also use query:

In [9]: df.query("PBL_AWI == ['Lake', 'River', 'Upland']")
Out[9]: 
                     Area
NSRCODE PBL_AWI          
CM      Lake     57124.82
        River     1603.91
LBH     Lake    258046.51
        River    44262.81

However, due to case sensitivity, 'upland' (lower case) won't be found. Therefore I recommend using fullmatch and set case=False:

In [10]: df.query("PBL_AWI.str.fullmatch('Lake|River|Upland', case=False).values")
Out[10]: 
                       Area
NSRCODE PBL_AWI            
CM      Lake       57124.82
        River       1603.91
        upland   2701289.27
LBH     Lake      258046.51
        River      44262.81
        upland  23843453.64

score 0 · Answer 7 · answered Oct 14 '20 at 18:32

This is an answer to a slight variant of the question asked that might save someone else a little time. If you are looking for a wildcard type match to a label whose exact value you don't know, you can use something like this:

q_labels = [ label for label in df.index.levels[1] if label.startswith('Q') ]
new_df = df[ df.index.isin(q_labels, level=1) ]

Filtering multiple items in a multi-index Python Panda dataframe

7 Answers7

Linked