19

I am experimenting with the groupby features of pandas, in particular

gb = df.groupby('model')
gb.hist()

Since gb has 50 groups the result is quite cluttered, I would like to explore the result only for the first 5 groups.

I found how to select a single group with groups or get_group (How to access pandas groupby dataframe by key), but not how to select multiple groups directly. The best I could do is :

groups = dict(list(gb))
subgroup = pd.concat(groups.values()[:4])
subgroup.groupby('model').hist()

Is there a more direct way?

Community
  • 1
  • 1
lib
  • 2,918
  • 3
  • 27
  • 53
  • Selecting the first n groups is a bit vague, perhaps you mean **how can you join the first n groups into a single dataframe**.. something along those lines ? And also, how would you like to select the groups ? Randomly , or according to the population of the group , etc? – dermen Jul 21 '15 at 10:30
  • For now I would just select them by their order, a bit like using head() or tail() just for having an idea of how the data looks like. I think my method is already joining the first groups in a single dataframe, but it would be nice also a more efficient solution – lib Jul 21 '15 at 10:37
  • 1
    you can get the groups by just calling `gp.groups` see:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.groups.html#pandas.core.groupby.GroupBy.groups you may be better off just filtering your df first so `df_filt = df[df['model'].isin(df['model'].unique()[:5])]` then `gb = df_filt.groupby('model')` #rest of code is the same as before – EdChum Jul 21 '15 at 10:40

5 Answers5

11

It'd be easier to just filter your df first and then perform the groupby:

In [155]:

df = pd.DataFrame({'model':np.random.randint(1,10,100), 'value':np.random.randn(100)})
first_five = df['model'].sort(inplace=False).unique()[:5]
gp = df[df['model'].isin(first_five)].groupby('model')
gp.first()
Out[155]:
          value
model          
1     -0.505677
2      1.217027
3     -0.641583
4      0.778104
5     -1.037858
Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
EdChum
  • 376,765
  • 198
  • 813
  • 562
11

You can do something like

new_gb = pandas.concat( [ gb.get_group(group) for i,group in enumerate( gb.groups) if i < 5 ] ).groupby('model')    
new_gb.hist()

Although, I would approach it differently. You can use the collections.Counter object to get groups fast:

import collections

df = pandas.DataFrame.from_dict({'model': pandas.np.random.randint(0, 3, 10), 'param1': pandas.np.random.random(10), 'param2':pandas.np.random.random(10)})
#   model    param1    param2
#0      2  0.252379  0.985290
#1      1  0.059338  0.225166
#2      0  0.187259  0.808899
#3      2  0.773946  0.696001
#4      1  0.680231  0.271874
#5      2  0.054969  0.328743
#6      0  0.734828  0.273234
#7      0  0.776684  0.661741
#8      2  0.098836  0.013047
#9      1  0.228801  0.827378
model_groups = collections.Counter(df.model)
print(model_groups) #Counter({2: 4, 0: 3, 1: 3})

Now you can iterate over the Counter object like a dictionary, and query the groups you want:

new_df = pandas.concat( [df.query('model==%d'%key) for key,val in model_groups.items() if val < 4 ] ) # for example, but you can select the models however you like  
#   model    param1    param2
#2      0  0.187259  0.808899
#6      0  0.734828  0.273234
#7      0  0.776684  0.661741
#1      1  0.059338  0.225166
#4      1  0.680231  0.271874
#9      1  0.228801  0.827378

Now you can use the built-in pandas.DataFrame.groupby function

gb = new_df.groupby('model')
gb.hist() 

Since model_groups contains all of the groups, you can just pick from it as you wish.

note

If your model column contains string values (names or something) instead of integers, it will all work the same - just change the query argument from 'model==%d'%key to 'model=="%s"'%key.

dermen
  • 5,252
  • 4
  • 23
  • 34
2

I don't know of a way to use the .get_group() method with more than one group.

You can, however, iterate through groups

It is still a bit ugly to do this, but here is one solution with iteration:

limit = 5
i = 0
for key, group in gd:
    print(key, group)
    i += 1
    if i >= limit:
        break

You could also do a loop with .get_group(), which imho, is a little prettier, but still quite ugly.

for key in list(gd.groups.keys())[:2]:
    print(gd.get_group(key))
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
firelynx
  • 30,616
  • 9
  • 91
  • 101
  • 2
    To use .get_group() method with more than one group, you need to pass a Tuple with values for key1 and values for key2 ... – user2265478 Oct 27 '16 at 13:29
1
gbidx=list(gb.indices.keys())[:4]
dfidx=np.sort(np.concatenate([gb.indices[x] for x in gbidx]))
df.loc[dfidx].groupby('model').hist()

gb.indices is faster than gb.groups or list(gb)

and I believe concat Index is faster than concat DataFrames

I've tried on my big csv file of ~416M rows 13 cols (incl. str) and 720MB in size, and groupby by more than one col

then changed col names into those in the Question

cdarlint
  • 1,485
  • 16
  • 14
-1
def get_groups(group_object):
    for i in group_object.groups.keys():
        print(f"____{i}____")
        display(group_object.get_group(i))


#get all groups by calling this method 

get_groups( any_group_which_you_made )
  • 2
    Hi, thanks for your reply. Posting a code snippet is fine but it's better if you explain how it solves the OP question. Welcome to Stack Overflow. – WaLinke Jan 02 '20 at 08:23
  • This does NOT get multiple groups at once, irrelevant for the question. – misantroop Nov 20 '22 at 02:40