Groupby matching pattern of different groups

Question

I have the following dataframe:

df = pd.DataFrame({'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
                   'Info': ['info1', 'info2', 'info3', 'info4', 'info5', 'info6', 
                            'info7', 'info8', 'info9', 'info10', 'info11', 'info12'],
                   'Category': ['157/120/RGB', '112/54/RGB', '14/280/CMYK', '50/100/RGB',
                                '150/88/CMYK', '160/100/G', '200/450/CMYK', '65/90/RGB',
                                '111/111/G', '244/250/RGB', '100/100/CMYK', '144/100/G']})

I need to get a number of dataframes equal to the number of right-sided Category string patterns, that is RGB, CMYK, G. Is there a way - maybe using regular expressions - to put just this string piece within getgroup method in order to create these groups? For instance:

df_RGB = df.groupby('Category').getgroup('...RGB')

what should I replace dots with?

score 3 · Accepted Answer · answered Nov 26 '20 at 09:40

You can try this with GroupBy.get_group here.

g = df['Category'].str.extract("/*(\w+)$").squeeze()
keys = g.unique() # if you want to see all the keys
grouped = df.groupby(g)

df_RGB = grouped.get_group('RGB')

   ID    Info     Category
0   1   info1  157/120/RGB
1   2   info2   112/54/RGB
3   4   info4   50/100/RGB
7   8   info8    65/90/RGB
9  10  info10  244/250/RGB

Details about regex pattern used regex101

Mayank Porwal · Answer 2 · 2020-11-26T09:32:27.617

You can use Series.str.split with df.groupby:

In [3747]: df['actual_category'] = df.Category.str.split('/').str[-1]

In [3765]: d = {k:v.iloc[:, :-1] for k,v in df.groupby('actual_category')}

In [3766]: d
Out[3766]: 
{'CMYK':     ID    Info      Category
 2    3   info3   14/280/CMYK
 4    5   info5   150/88/CMYK
 6    7   info7  200/450/CMYK
 10  11  info11  100/100/CMYK,
 'G':     ID    Info   Category
 5    6   info6  160/100/G
 8    9   info9  111/111/G
 11  12  info12  144/100/G,
 'RGB':    ID    Info     Category
 0   1   info1  157/120/RGB
 1   2   info2   112/54/RGB
 3   4   info4   50/100/RGB
 7   8   info8    65/90/RGB
 9  10  info10  244/250/RGB}

This will give you a dict with keys as Category names and values as individual dataframes for each category.

In [3753]: df_RGB = d['RGB']

In [3754]: df_RGB
Out[3754]: 
   ID    Info     Category
0   1   info1  157/120/RGB
1   2   info2   112/54/RGB
3   4   info4   50/100/RGB
7   8   info8    65/90/RGB
9  10  info10  244/250/RGB

jezrael · Answer 3 · 2020-11-26T09:52:32.003

2

You can create dictionary of Dataframes by convert groupby object to dict with grouping by last values after last /:

d = dict(iter(df.groupby(df['Category'].str.split('/').str[-1])))
print (d)
{'CMYK':     ID    Info      Category
2    3   info3   14/280/CMYK
4    5   info5   150/88/CMYK
6    7   info7  200/450/CMYK
10  11  info11  100/100/CMYK, 'G':     ID    Info   Category
5    6   info6  160/100/G
8    9   info9  111/111/G
11  12  info12  144/100/G, 'RGB':    ID    Info     Category
0   1   info1  157/120/RGB
1   2   info2   112/54/RGB
3   4   info4   50/100/RGB
7   8   info8    65/90/RGB
9  10  info10  244/250/RGB}

print (d['CMYK'])
    ID    Info      Category
2    3   info3   14/280/CMYK
4    5   info5   150/88/CMYK
6    7   info7  200/450/CMYK
10  11  info11  100/100/CMYK

It is not recommended, but possible create DataFrames by groups names like:

for i, g in df.groupby(df['Category'].str.split('/').str[-1]):
    globals()['df_' + str(i)] =  g

print (df_CMYK)

    ID    Info      Category
2    3   info3   14/280/CMYK
4    5   info5   150/88/CMYK
6    7   info7  200/450/CMYK
10  11  info11  100/100/CMYK

edited Nov 26 '20 at 09:52

answered Nov 26 '20 at 09:27

jezrael

822,522
95
1,334
1,252

I have question. What is better to use `dict()` constructor or `{}`? Or it does not really matter? – Mayank Porwal Nov 26 '20 at 09:40
@MayankPorwal - What like more ;) It is same – jezrael Nov 26 '20 at 09:40
But dict vs your dict comprehension is not same. – jezrael Nov 26 '20 at 09:41
`dict(df.groupby(df['Category'].str.split('/').str[-1]).__iter__())` would eliminate `tuple` conversion. What converting `tuple` does is call `__iter__`, we can elimate `tuple` it's redundant. – Ch3steR Nov 26 '20 at 09:41
@Ch3steR - It is internal method ;) So rather tuple ;) – jezrael Nov 26 '20 at 09:42
Yes agreed but converting to `tuple` is unnecessary. Maybe `dict(iter(df.groupby(...)))`? – Ch3steR Nov 26 '20 at 09:48
@Ch3steR - ya, it is good ;) myuch btter like `.__iter__()` – jezrael Nov 26 '20 at 09:49
@jezrael Yes, For large dfs making a tuple would be costly on memory. Since `iter` evaluates on demand(lazily evaluation) `iter` is a nice replacement of `tuple` here. – Ch3steR Nov 26 '20 at 09:51

Groupby matching pattern of different groups

3 Answers3