filter pandas dataframe on one level of a multi level index

Question

If I have a pandas dataframe with a multi level index, how can I filter by one of the levels of that index. For example:

df = pd.DataFrame({"id": [1,2,1,2], "time": [1, 1, 2, 2], "val": [1,2,3,4]})
df.set_index(keys=["id", "time"], inplace=True)

I would like to do something like:

df[df["time"] > 1]

but time is no longer a column. I could keep it as a column but I don't want to drag around copies of data.

Are you taking about something along the lines of iterating through a pandas dataframe? — Shodmoth, May 23 '18 at 19:53
http://pandas.pydata.org/pandas-docs/stable/advanced.html#basic-indexing-on-axis-with-multiindex — wwii, May 23 '18 at 20:01

score 25 · Answer 1 · answered May 23 '18 at 19:55

25

In [17]: df[df.index.get_level_values('time') > 1]
Out[17]:
         val
id time
1  2       3
2  2       4

@piRSquared's solution is more idiomatic though...

answered May 23 '18 at 19:55

1

interesting, this is along the lines of what i would have guessed the answer should look like. will keep in mind that query is standard practice – Alex May 23 '18 at 20:12
^ exactly what I thought. The idiomatic approach gets complicated for a beginner to that method (but who knows the usual way of filtering a df), especially when you'd wanna compare with `.isin()` or something similar. I imagined it wouldn't work out of the box since there aren't enough illustrations on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html but then I found https://stackoverflow.com/a/33991869/1332401 (so `df[df.A.isin(list_ids)]` vs `df.query('A in @list_ids)`) – arcolife Oct 22 '20 at 08:55

piRSquared · Accepted Answer · 2018-05-23T19:58:22.200

19

`query`

df.query('time > 1')

         val
id time     
1  2       3
2  2       4

DataFrame index must be lexsorted

df.sort_index().loc[pd.IndexSlice[:, 2:], :]

         val
id time     
1  2       3
2  2       4

edited May 23 '18 at 19:58

answered May 23 '18 at 19:52

piRSquared

1

really? you have to pass strings around instead of having it in code? – Alex May 23 '18 at 19:54
This is one approach. Happens to also be a fast one at scale. https://stackoverflow.com/a/46165056/2336654 – piRSquared May 23 '18 at 19:56
5

@Alex Note; `level = 'time', df.query("@level > 1")` works with variables. – cs95 May 23 '18 at 20:05
@piRSquared: alright, i'm sold on the query thing. looks like the the best way to go for me. do people in pandas world generally just build up strings dynamically in code and pass those around? ironically that kind of thing tends to be discouraged in R – Alex May 23 '18 at 20:08
@coldspeed: good tip, was just looking at that in the docs, thanks! – Alex May 23 '18 at 20:08
@coldspeed, I was searching for hooking variables in `.query()`. Thanks. – harvpan May 23 '18 at 20:16
@Alex not necessarily. You can take coldspeed's tip. But honestly, if you're looking for more dynamic components, reconsidered MaxU's approach. – piRSquared May 23 '18 at 20:16