1

I have a fairly large DataFrame, say 600 indexes, and want to use filter criteria to produce a reduced version of the DataFrame where the criteria is true. From the research I've done, filtering works well when you're applying expressions to the data, and already know the index you're operating on. What I want to do, however, is apply the filtering criteria to the index. See example below.

MultiIndex is bold, names of MultiIndex names are italic.

enter image description here

I'd like to apply the criteria like follows (or something) along these lines:

df = df[MultiIndex.query('base == 115 & Al.isin(stn)')]

Then maybe do something like this:

df = df.transpose()[MultiIndex.query('Fault.isin(cont)')].transpose

To result in:

enter image description here

I think fundamentally I'm trying to produce a boolean list to mask the MultiIndex. If there is a quick way to apply the pandas query to a 2d list? that would be acceptable. As of now it seems like an option would be to take the MultiIndex, convert it to a DataFrame, then I can apply filtering as I want to get the TF array. I'm concerned that this will be slow though.

likethevegetable
  • 264
  • 1
  • 4
  • 17
  • share a dataframe, both input dataframe and expected output, and not pics : https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – sammywemmy Feb 08 '20 at 15:09

2 Answers2

1

As you noticed, indexes aren't great for querying using filter expressions. There's df.filter() but it doesn't really seem to work well on a MultiIndex.

You can still filter the MultiIndex values as an iterable of Python tuples, and then use .loc to access the filtered results.

This works:

rows = [(season, cont)
        for (season, cont) in df.index
        if 'Fault' in cont]
cols = [(stn, base)
        for (stn, base) in df.columns
        if base == 115 and 'Al' in stn]
df.loc[rows, cols]
filbranden
  • 8,522
  • 2
  • 16
  • 32
  • 1
    Thanks for getting back. My solution was to convert the MultiIndex to a DataFrame; something like this: `df_MI = index.to_frame()` then `bool_array = df_MI['seas'].str.contains('winter').values`. I would love to use the cute syntax like `df_MI.query('base < 200')` as it would make chaining operations nice and easy. – likethevegetable Feb 08 '20 at 22:16
  • 1
    @likethevegetable I added a separate answer regarding unpivoting the data in a way that you can easily use `query()` on it. You can also use `str.contains` inside a `query()` for the substring matches. – filbranden Feb 08 '20 at 22:47
1

If what you're after is using the df.query() nifty syntax to slice your data, then you're better off "unpivoting" your DataFrame, turning all indices and column labels into regular fields.

You can create an "unpivot" DataFrame with:

df_unpivot = df.stack(level=[0, 1]).rename('value').reset_index()

Which will produce a DataFrame that looks like this:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
1 Summer Fault Beta   115   0.8
2 Summer Fault Gamma  230   0.7
3 Summer Trip  Alpha  115   1.2
4 Summer Trip  Beta   115   0.9
...

Which you can then query with:

df_unpivot.query(
    'cont.str.contains("Fault") and '
    'stn.str.contains("Al") and '
    'base == 115'
)

Which produces:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
6 Winter Fault Alpha  115   0.7

Which is the two values you were expecting.

filbranden
  • 8,522
  • 2
  • 16
  • 32
  • 1
    That is a very cool way to do it, thanks for sharing. I like it; the reason why I have my data with the multiindex format to begin with is that it translates to my Excel sheets very easily. If I need to report on faults with a base of 115, the unpivoted format will translate nicely to a report ready table. Do you have an idea of the performance of unpivoting? I'm toying with some functions that convert the columns and indexes to data frames and it's working quite nicely. – likethevegetable Feb 08 '20 at 22:58
  • 1
    For performance measurements, `%timeit` is your friend! But Pandas is supposedly pretty optimized for this kinds of operations, so I'd expect it to be pretty fast. You can always pivot back your data once you've queried it, to get it back to your expected presentation format. – filbranden Feb 08 '20 at 23:21
  • Just wanted to add that a general way to stack the df as I need it would be: `df1 = df.stack(level=list(range(df.index.nlevels))).rename('value').reset_index() `, then you can pivot (unstack) after the query to get the same format with: `df2 = df1.pivot_table(index=df.index.names, columns=df.columns.names)` then `df2.columns = df2.columns.droplevel(0)` – likethevegetable Feb 09 '20 at 16:36