I admit that I am not a Python guru, but still I find dealing with Pandas DataFrameGroupBy
and SeriesGroupBy
objects exceptionally counter-intuitive. ( I have an R background.)
I have the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'irrelevant1': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'irrelevant2': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'irrelevant3': ['foo', 'foo', 'foo','bar','bar',
'foo','bar','bar'],
'amount' : np.random.randn(8)}, columns= ['id','code','colour', 'irrelevant1', 'irrelevant2', 'irrelevant3', 'amount'])
I want to be able to get the id
's grouped by code
and colour
. The code below does the grouping but keeps all columns.
gb = df.groupby(['code','colour'])
gb.head(5)
id code colour irrelevant1 irrelevant2 irrelevant3 amount
code colour
one black 0 1 one black foo foo foo -0.644170
white 1 2 one white foo foo foo 0.912372
6 7 one white bar bar bar 0.530575
three black 5 6 three black foo foo foo -0.123806
white 3 4 three white bar bar bar -0.387080
two black 4 5 two black bar bar bar -0.578107
white 2 3 two white foo foo foo 0.768637
7 8 two white bar bar bar -0.282577
Questions:
1) In gb
, how do I only store the id
column (and not even any index) and get rid of the rest?
2) Once I have the desired DataFrameGroupBy
gb
, how do I access the id
s of cases where {code = one and colour=white} ? I tried gb.get_group('one','white')
and gb.get_group(['one','white'])
but they do not work.
3) How do I access entries where {colour=white}, i.e. lacking the code
index ?
4) Finally, the manual is not very helpful, do you know of any sources where there are examples of how to create and access these grouped objects?