35

I'm suspicious that this is trivial, but I yet to discover the incantation that will let me select rows from a Pandas dataframe based on the values of a hierarchical key. So, for example, imagine we have the following dataframe:

import pandas
df = pandas.DataFrame({'group1': ['a','a','a','b','b','b'],
                       'group2': ['c','c','d','d','d','e'],
                       'value1': [1.1,2,3,4,5,6],
                       'value2': [7.1,8,9,10,11,12]
})
df = df.set_index(['group1', 'group2'])

df looks as we would expect:

enter image description here

If df were not indexed on group1 I could do the following:

df['group1' == 'a']

But that fails on this dataframe with an index. So maybe I should think of this like a Pandas series with a hierarchical index:

df['a','c']

Nope. That fails as well.

So how do I select out all the rows where:

  1. group1 == 'a'
  2. group1 == 'a' & group2 == 'c'
  3. group2 == 'c'
  4. group1 in ['a','b','c']
cs95
  • 379,657
  • 97
  • 704
  • 746
JD Long
  • 59,675
  • 58
  • 202
  • 294

4 Answers4

50

Try using xs to be very precise:

In [5]: df.xs('a', level=0)
Out[5]: 
        value1  value2
group2                
c          1.1     7.1
c          2.0     8.0
d          3.0     9.0

In [6]: df.xs('c', level='group2')
Out[6]: 
        value1  value2
group1                
a          1.1     7.1
a          2.0     8.0
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • I *knew* there had to be a more simple idiom. Thanks! – JD Long Aug 13 '12 at 21:40
  • 1
    what about `group1 in ['a','b','c']` – Daniel Aug 14 '12 at 04:01
  • 3
    df[[group1 in ['a', 'b', 'c'] for group1, group2 in df.index]] – Wouter Overmeire Aug 14 '12 at 07:14
  • @lodagro that absolutely works, but I have no idea why. Can you explain how `[group1 in ['a', 'b', 'c'] for group1, group2 in df.index]` works? – JD Long Aug 14 '12 at 13:54
  • df.index behaves like a tuple list here. So you iterate through the tuples, discarding group 2 (you assign the first element of each tuple to "group1" and the second to "group2" but only use "group1") and checking whether the first element of the tuple is in the list ['a', 'b', 'c']. This creates a boolean mask which is then used for the subsetting. – Arthur G Aug 14 '12 at 15:14
  • I suspect though that np.in1d(df.index.labels[0], [label_of_a,label_of_b,label_of_c]) will be much faster and does the same job. Please correct me if I am wrong. – Arthur G Aug 14 '12 at 15:17
  • np.in1d(df.index.labels[0], match(['a','b', 'c'], df.index.levels[0])) seems to do the trick the way I understood the MultiIndex code (labels are created using Categorical.from_arrays(values) ) – Arthur G Aug 14 '12 at 16:48
  • 1
    Technique used is boolean indexing (see also http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing). Arthur explained how i created the boolean index, as he indicated there are other ways to create the True/False iterable. But in the end it comes down to boolean indexing for part #4 of your question. Note that Arthur opened an issue on GitHub with a request to simplify this use case (see https://github.com/pydata/pandas/issues/1766). – Wouter Overmeire Aug 14 '12 at 19:06
  • Thank you all for both explaining things to me and also spending the time and energy to make Pandas better. – JD Long Aug 15 '12 at 13:03
9

Syntax like the following will work:

df.ix['a']
df.ix['a'].ix['c']

since group1 and group2 are indices. Please forgive my previous attempt!

To get at the second index only, I think you have to swap indices:

df.swaplevel(0,1).ix['c']

But I'm sure Wes will correct me if I'm wrong.

Chris Fonnesbeck
  • 4,143
  • 4
  • 29
  • 30
  • 1
    That won't work, because `'group1' == 'a'` is simply False, so this is just `df.ix[False]`, or `df.ix[0]`. As a result, `df.ix['group1'=='a']` will be exactly the same as `df.ix['group2'=='d']`. – DSM Aug 13 '12 at 20:28
  • We're close... but something's not quite right. When I do `df.ix['group2'=='d']` I only get one record back... and it's the first record, not one where group2==d. – JD Long Aug 13 '12 at 20:32
  • @JDLong: it's the first record because 'group2' == 'd' is False, which is 0. Every one of these comparisons is evaluated *first*, and so it's `.ix[0]`, `.ix[0,0]`, and `.ix[0]` again. – DSM Aug 13 '12 at 20:33
  • so `df.ix['a'].ix['d']` works... so that covers the first two examples. But how would I select only on the second part of the index? – JD Long Aug 13 '12 at 20:36
  • I like the swaplevel... but not sure how to use it for indexes with more than 2 levels. Naturally my real life application uses a deep index :( – JD Long Aug 13 '12 at 20:51
  • Well, as long as you knew the depth of the index, you can always `swaplevel(0,x)`, but I assume you do not necessarily know the index of the index! – Chris Fonnesbeck Aug 13 '12 at 20:54
  • swaplevel also takes names as parameter, so if you know at least the name of it, that could help. – K.-Michael Aye Jan 06 '13 at 19:19
1

In Python 0.19.0 there is a new suggested approach, which is explained here1. I believe the clearest example they give is the following, in which they slice from a four-level indexing. This is how the dataframe is made:

In [46]: def mklbl(prefix,n):
   ....:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ....: 

In [47]: miindex = pd.MultiIndex.from_product([mklbl('A',4),
   ....:                                       mklbl('B',2),
   ....:                                       mklbl('C',4),
   ....:                                       mklbl('D',2)])
   ....: 

In [48]: micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
   ....:                                        ('b','foo'),('b','bah')],
   ....:                                       names=['lvl0', 'lvl1'])
   ....: 

In [49]: dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
   ....:                     index=miindex,
   ....:                     columns=micolumns).sort_index().sort_index(axis=1)
   ....: 

In [50]: dfmi
Out[50]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
...          ...  ...  ...  ...
A3 B1 C0 D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

And this is how they select the different rows:

In [51]: dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[51]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
...          ...  ...  ...  ...
A3 B0 C1 D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

So quite simply, in df.loc[(indices),:], you specify the indices that you want to select per level, from highest level to lowest. If you do not want to make a selection of the lowest level(s) of indices, you can omit specifying them. If you don't want to make a slice between other specified levels, you add slice(None). Both cases are shown in the example, where level D is omitted and level B is specified between A and C.

Robin Kramer-ten Have
  • 818
  • 2
  • 13
  • 34
1

Alternatively you can use query:

1. group1 == 'a'

In [11]: df.query('group1 == "a"')
Out[11]: 
               value1  value2
group1 group2                
a      c          1.1     7.1
       c          2.0     8.0
       d          3.0     9.0

2. group1 == 'a' & group2 == 'c'

In [12]: df.query('group1 == "a" & group2 == "c"')
Out[12]: 
               value1  value2
group1 group2                
a      c          1.1     7.1
       c          2.0     8.0

3. group2 == 'c'

In [13]: df.query('group2 == "c"')
Out[13]: 
               value1  value2
group1 group2                
a      c          1.1     7.1
       c          2.0     8.0

4. group1 in ['a','b','c']

In [14]: df.query('group1 in  ["a", "b", "c"]')
Out[14]: 
               value1  value2
group1 group2                
a      c          1.1     7.1
       c          2.0     8.0
       d          3.0     9.0
b      d          4.0    10.0
       d          5.0    11.0
       e          6.0    12.0
rachwa
  • 1,805
  • 1
  • 14
  • 17