10

The latest version of Pandas supports multi-index slicers. However, one needs to know the integer location of the different levels to use them properly.

E.g. the following:

idx = pd.IndexSlice
dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]

assumes that we know that the third row level is the one we want to index with C1 and C3, and that the second column level is the one we want to index with foo.

Sometimes I know the names of the levels but not their location in the multi-index. Is there a way to use multi-index slices in this case?

For example, say that I know what slices I want to apply on each level name, e.g. as a dictionary:

'level_name_1' -> ':' 
'level_name_2' -> ':'
'level_name_3' -> ['C1', 'C3']

but that I don't know the position (depth) of these levels in the multi-index. Does Pandas a built-in indexing mechanism for this?

Can I still use pd.IndexSlice objects somehow if I know level names, but not their position?

PD: I know I could could use reset_index() and then just work with flat columns, but I would like to avoid resetting the index (even if temporarily). I could also use query, but query requires index names to be compatible with Python identifiers (e.g. no spaces, etc).


The closest I have seen for the above is:

df.xs('C1', level='foo')

where foo is the name of the level and C1 is the value of interest.

I know that xs supports multiple keys, e.g.:

df.xs(('one', 'bar'), level=('second', 'first'), axis=1)

but it does not support slices or ranges (like pd.IndexSlice does).

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • Also see [this question](https://stackoverflow.com/questions/22987015/slice-pandas-dataframe-by-multiindex-level-or-sublevel) and [this question](https://stackoverflow.com/questions/25224545/filtering-muliple-items-in-a-multi-index-python-panda-dataframe/) – Pietro Battiston Dec 04 '17 at 15:34

3 Answers3

7

This is still an open issue for enhancement, see here. Its pretty straightforward to support this. pull-requests are welcome!

You can easily do this as a work-around:

In [11]: midx = pd.MultiIndex.from_product([list(range(3)),['a','b','c'],pd.date_range('20130101',periods=3)],names=['numbers','letters','dates'])

In [12]: midx.names.index('letters')
Out[12]: 1

In [13]: midx.names.index('dates')
Out[13]: 2

Here's a complete example

In [18]: df = DataFrame(np.random.randn(len(midx),1),index=midx)

In [19]: df
Out[19]: 
                                   0
numbers letters dates               
0       a       2013-01-01  0.261092
                2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-01 -1.515866
                2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-01 -0.253103
                2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-01 -0.108325
                2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-01 -1.922214
                2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-01 -0.419775
                2013-01-02  1.511700
                2013-01-03  0.994332
2       a       2013-01-01 -0.020299
                2013-01-02 -0.749474
                2013-01-03 -1.478558
        b       2013-01-01 -1.357671
                2013-01-02  0.161185
                2013-01-03 -0.658246
        c       2013-01-01 -0.564796
                2013-01-02 -0.333106
                2013-01-03 -2.814611

This is your dict of level names -> slices

In [20]: slicers = { 'numbers' : slice(0,1), 'dates' : slice('20130102','20130103') }

This creates an indexer that is empty (selects everything)

In [21]: indexer = [ slice(None) ] * len(df.index.levels)

Add in your slicers

In [22]: for n, idx in slicers.items():
              indexer[df.index.names.index(n)] = idx

And select (this has to be a tuple, but was a list to start as we had to modify it)

In [23]: df.loc[tuple(indexer),:]
Out[23]: 
                                   0
numbers letters dates               
0       a       2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-02  1.511700
                2013-01-03  0.994332
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thank you @Jeff. This s effectively what I was looking for. Regarding the discussion in the GitHub thread, why is what you have written here not enough? Wouldn't a PR just encapsulate what you wrote above? Or is there anything else that remains to be done? – Amelio Vazquez-Reina Jun 09 '14 at 19:57
  • 2
    the PR would basically do what I have here (with some adds tl error checking tests and docs) – Jeff Jun 09 '14 at 20:13
1

I use a custom function for this. The name is sel, inspired by the xarray method of the same name.

def sel(df, /, **kwargs):
    """
    Select into a DataFrame by MultiIndex name and value

    This function is similar in functionality to pandas .xs() and even more similar (in interface) to xarray's .sel().

    Example:

    >>> index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'],
    ...                                     ['North', 'South']],
    ...                                    names=['State', 'Direction'])
    >>> df = pd.DataFrame(index=index,
    ...                   data=np.random.randint(0, 10, (6,4)),
    ...                   columns=list('abcd'))
    >>> sel(df, State='TX')
                        a  b  c  d
       State Direction
       TX    North      5  5  9  5
             South      0  6  8  2
    >>> sel(df, State=['TX', 'FL'], Direction='South')
                        a  b  c  d
       State Direction
       TX    South      0  6  8  2
       FL    South      6  7  5  2

    indexing syntax is index_name=indexer where the indexer can be:

    - single index value
    - slice by using the slice() function
    - a list of index values
    - other indexing modes supported by indivdual axes in .loc[]

    Unnamed index levels can be selected using names _0, _1 etc where the number is the index level.

    raises KeyError if an invalid index level name is used.
    """
    # argument checking
    available_names = [name or f'_{i}' for i, name in enumerate(df.index.names)]
    extra_args = set(kwargs.keys()) - set(available_names)
    if extra_args:
        raise KeyError(f"Invalid keyword arguments, no index(es) {extra_args} in dataframe. Available indexes: {available_names}.")
    # compute indexers per index level
    index_sel = tuple(kwargs.get(name or f'_{i}', slice(None)) for i, name in enumerate(df.index.names))
    if not index_sel:
        index_sel = slice(None)
    # Fixup for single level indexes
    if len(df.index.names) == 1 and index_sel:
        index_sel = index_sel[0]
    return df.loc[index_sel, :]
creanion
  • 2,319
  • 2
  • 13
  • 17
0

The .query() method doesn't support the regular slices in the same way unfortunately but it does support selecting on index levels by name and with intervals! Thus it qualifies as another answer to your question.

Query supports quoting index names by using backticks, as shown below.

# Get an example dataset from seaborn
import pandas as pd
import seaborn as sns
df = sns.load_dataset("penguins")
df = df.rename_axis("numerical index / ħ")  # strange name to show escaping.
df = df.set_index(['species', 'island'], append=True)


# Working examples
# less than
df.query("`numerical index / ħ` < 100")

# range
slc = range(9, 90)
df.query("`numerical index / ħ` in @slc")

# Subsets
islands = ['Dream', 'Biscoe']
df.query("island in @islands and species == 'Adelie'")

penguins example table

creanion
  • 2,319
  • 2
  • 13
  • 17