4

I admit that I am not a Python guru, but still I find dealing with Pandas DataFrameGroupBy and SeriesGroupBy objects exceptionally counter-intuitive. ( I have an R background.)

I have the dataframe below:

import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
                   'code' : ['one', 'one', 'two', 'three',
                             'two', 'three', 'one', 'two'],
                   'colour': ['black', 'white','white','white',
                           'black', 'black', 'white', 'white'],
                   'irrelevant1': ['foo', 'foo', 'foo','bar','bar',
                                     'foo','bar','bar'],
                   'irrelevant2': ['foo', 'foo', 'foo','bar','bar',
                                     'foo','bar','bar'],
                   'irrelevant3': ['foo', 'foo', 'foo','bar','bar',
                                     'foo','bar','bar'],
                   'amount' : np.random.randn(8)},  columns= ['id','code','colour', 'irrelevant1', 'irrelevant2', 'irrelevant3', 'amount'])

I want to be able to get the id's grouped by code and colour. The code below does the grouping but keeps all columns.

gb = df.groupby(['code','colour'])
gb.head(5)
                id   code colour irrelevant1 irrelevant2 irrelevant3    amount
code  colour                                                                  
one   black  0   1    one  black         foo         foo         foo -0.644170
      white  1   2    one  white         foo         foo         foo  0.912372
             6   7    one  white         bar         bar         bar  0.530575
three black  5   6  three  black         foo         foo         foo -0.123806
      white  3   4  three  white         bar         bar         bar -0.387080
two   black  4   5    two  black         bar         bar         bar -0.578107
      white  2   3    two  white         foo         foo         foo  0.768637
             7   8    two  white         bar         bar         bar -0.282577

Questions:

1) In gb, how do I only store the id column (and not even any index) and get rid of the rest?

2) Once I have the desired DataFrameGroupBy gb, how do I access the ids of cases where {code = one and colour=white} ? I tried gb.get_group('one','white') and gb.get_group(['one','white']) but they do not work.

3) How do I access entries where {colour=white}, i.e. lacking the code index ?

4) Finally, the manual is not very helpful, do you know of any sources where there are examples of how to create and access these grouped objects?

Community
  • 1
  • 1
Zhubarb
  • 11,432
  • 18
  • 75
  • 114

1 Answers1

7

For your problem, you don't even need to perform a groupby (but you should read more about it in the prose docs.

A better solution would be a MultiIndex:

In [36]: df = df.set_index(['code', 'colour']).sort_index()

In [37]: df
Out[37]: 
              id irrelevant1 irrelevant2 irrelevant3    amount
code  colour                                                  
one   black    1         foo         foo         foo  0.103045
      white    2         foo         foo         foo  0.751824
      white    7         bar         bar         bar -1.275114
three black    6         foo         foo         foo  0.311305
      white    4         bar         bar         bar -0.416722
two   black    5         bar         bar         bar  1.534859
      white    3         foo         foo         foo -1.068399
      white    8         bar         bar         bar -0.243893

[8 rows x 5 columns]

That takes care of 1.

2: Use the familiar slicing syntax:

In [38]: df.loc['one', 'white']
Out[38]: 
             id irrelevant1 irrelevant2 irrelevant3    amount
code colour                                                  
one  white    2         foo         foo         foo  0.751824
     white    7         bar         bar         bar -1.275114

[2 rows x 5 columns]

3: This is a cross-section, use .xs:

In [39]: df.xs('white', level='colour')
Out[39]: 
       id irrelevant1 irrelevant2 irrelevant3    amount
code                                                   
one     2         foo         foo         foo  0.751824
one     7         bar         bar         bar -1.275114
three   4         bar         bar         bar -0.416722
two     3         foo         foo         foo -1.068399
two     8         bar         bar         bar -0.243893

[5 rows x 5 columns]

4: Examples are all over the place. Check the pandas / groupby tag here, this section of the docs is being worked on right now, the prose docs linked above.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Thank you Tom, I do not want to store the entire df in memory (it is huge) but just the `id` column. Is this the best way to achieve that: `df = df.set_index(['code', 'colour']).sort_index(); df = df['id']` ? Also, is the reason you recommend this index resetting approach is because it is faster than `groupby()` ? – Zhubarb Jan 15 '14 at 14:32
  • For large datasets have a look at [this question](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas/14268804#14268804) for an HDF / pandas workflow that is great for doing these cross sections. I'm not sure about your `id` question, but if you just need `id` there's no need to use `set_index`. This will be faster, but it's also more appropriate. Typically with `groupby`, you're doing a `split-apply-combine` operation. In this case it looks like you just have the `split` part, which the `.loc` and `.xs` indexing handle. – TomAugspurger Jan 15 '14 at 14:46
  • Great, finally can you briefly clarify this: "I'm not sure about your id question, but if you just need id there's no need to use `set_index`." I only want to see the `id` column, I am not interested in the rest of the columns. What would the code be then? – Zhubarb Jan 15 '14 at 14:49
  • If you only want the `id` column, it's just `df = df['id']`, or better yet only read that column in the first place. I'm not entirely sure why you would want just the `id` column though. – TomAugspurger Jan 15 '14 at 14:55
  • I want to use `df` later in my code (in a separate function) as a lookup table to retrieve the list of `id`s that match a given code/colour combination. Other columns are necessary for other actions but not for this look-up object. – Zhubarb Jan 15 '14 at 14:57