1

I am lost in a sea of ix, xs, MultiIndex, get_level_values and other Pandas.

I have a Series with a 3-level multi-index. What is an efficient way to slice my Series based on values at the different levels?

My Series looks like this:

days  id                      start_date
0     S0036-4665(00)04200108  2013-05-18      1
3     S0036-4665(00)04200108  2013-05-18      1
5     S0036-4665(00)04200108  2013-05-18      3
13    S0036-4665(00)04200108  2013-05-18      1
19    S0036-4665(00)04200108  2013-05-18      1
39    S0036-4665(00)04200108  2013-05-18      1
...

Obviously the values of id and start_date vary as you go down the fame

I would like to be able to slice based on the following: - days within numeric range - id within a specific set - start_date within a specific date range

So far, I found this solution, which suggests using df[df.index.get_level_values('a').isin([5, 7, 10, 13])], and I figured out that I can do:

s.select(lambda x: x[0] < 20 and (x[1] in set('some id', 'other id') ))

Are either of those the best solution? I felt that I should be able to do something with xs, or ix, but the former seems to only let you filter by a specific value, and the latter only indexes on the position in the series?

Community
  • 1
  • 1
pocketfullofcheese
  • 8,427
  • 9
  • 41
  • 57
  • I just found this question, that may provide an answer. But still would like input on what is an efficient way to do this (it is a very large dataset, about 1M rows) http://stackoverflow.com/questions/17921010/how-to-query-multiindex-index-columns-values-in-pandas – pocketfullofcheese Mar 20 '14 at 21:39
  • try using master, this is already in: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#multiindexing-using-slicers; interested to hear some feedback – Jeff Mar 20 '14 at 22:31
  • nice. Is it possible to pass a range of values? The examples use lists of values. – pocketfullofcheese Mar 20 '14 at 22:39

1 Answers1

4

Here's an example; this requires current master and will be available in 0.14. Docs are here: http://pandas-docs.github.io/pandas-docs-travis/indexing.html#multiindexing-using-slicers

Create a multi-index (this happens to be a cartesian product of the inputs, but that is not necessary)

In [28]: s = Series(np.arange(27),
               index=MultiIndex.from_product(
                     [[1,2,3],
                      ['foo','bar','bah'],
                      date_range('20130101',periods=3)])
                    ).sortlevel()

Always make sure that you are fully sorted

In [29]: s.index.lexsort_depth
Out[29]: 3

In [30]: s
Out[30]: 
1  bah  2013-01-01     6
        2013-01-02     7
        2013-01-03     8
   bar  2013-01-01     3
        2013-01-02     4
        2013-01-03     5
   foo  2013-01-01     0
        2013-01-02     1
        2013-01-03     2
2  bah  2013-01-01    15
        2013-01-02    16
        2013-01-03    17
   bar  2013-01-01    12
        2013-01-02    13
        2013-01-03    14
   foo  2013-01-01     9
        2013-01-02    10
        2013-01-03    11
3  bah  2013-01-01    24
        2013-01-02    25
        2013-01-03    26
   bar  2013-01-01    21
        2013-01-02    22
        2013-01-03    23
   foo  2013-01-01    18
        2013-01-02    19
        2013-01-03    20
dtype: int64

This is helpful to define to reduce verbiage (this groups levels together for a single axis)

In [33]: idx = pd.IndexSlice

Select me where level 0 is 2 and level 1 is either bar or foo

In [31]: s.loc[idx[[2],['bar','foo']]]
Out[31]: 
2  bar  2013-01-01    12
        2013-01-02    13
        2013-01-03    14
   foo  2013-01-01     9
        2013-01-02    10
        2013-01-03    11
dtype: int64

Same as above, but also level 2 is equal to 20130102

In [32]: s.loc[idx[[2,3],['bar','foo'],'20130102']]
Out[32]: 
2  bar  2013-01-02    13
   foo  2013-01-02    10
3  bar  2013-01-02    22
   foo  2013-01-02    19
dtype: int64

Here is an example of using a boolean indexer instead of a level indexer.

In [43]: s.loc[idx[[2,3],['bar','foo'],s<20]]
Out[43]: 
2  bar  2013-01-01    12
        2013-01-02    13
        2013-01-03    14
   foo  2013-01-01     9
        2013-01-02    10
        2013-01-03    11
3  foo  2013-01-01    18
        2013-01-02    19
dtype: int64

Here is an example of omitting some levels (note that am not using idx here, as they are essentially equivalent with a Series; more useful when indexing a DataFrame)

In [47]: s.loc[:,['bar','foo'],'20130102']
Out[47]: 
1  bar  2013-01-02     4
   foo  2013-01-02     1
2  bar  2013-01-02    13
   foo  2013-01-02    10
3  bar  2013-01-02    22
   foo  2013-01-02    19
dtype: int64
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thanks. I haven't had a chance to test this because I haven't updated pandas from Master yet. But it certainly seems to address my needs. – pocketfullofcheese Mar 20 '14 at 23:27
  • @Jeff, didn't you 'misuse' the IndexSlice a little bit? Because I thought it was to get slices for different levels for either columns or indices. But not for within one axis (columns/indices) between different levels (I mean: all slices/selections for one axis should be within one `idx`). In any case, you can just leave out the `idx` in your examples, like `s.loc[[2],['bar','foo']]`, or otherwise put all levels in the idx like `s.loc[idx[[2],['bar','foo']]]` to prevent suprises that the first one does not work anymore when using a dataframe instead of a series. – joris Mar 21 '14 at 08:14
  • @joris yep. I fixed it up; using ``pd.IndexSlice`` is not strictly necessary when indexing a Series, but a useful idiom – Jeff Mar 21 '14 at 12:34