Groupby and keep only groups that contain values within a list

Question

hello I have a df such as

list_vales_regex=['ABC',DEF']
Groups Names
G1 ABC_9
G1 ZTY_2
G1 SGG56
G1 BBCHU
G1 DEFE_8
G2 ABC_6
G2 GGDY
G3 ABC_6
G3 DEF98
G3 DEF89
G4 DEF_09
G4 DGE7
G5 DGGE22
G5 DGGE23

and I would like to keep only groups that do contain both (ABC and DEF Names)

Here in the exemple only G1 and G3 have ot be kept. Does someone have an idea ?

jezrael · Accepted Answer · 2020-06-02T09:59:22.587

First get values by list to helper Series by Series.str.extract with | for join values for regex or and then compare values converted to sets per groups in GroupBy.transform:

list_vales_regex=['ABC','DEF']

s = df['Names'].str.extract(f'({"|".join(list_vales_regex)})', expand=False)
df = df[s.groupby(df['Groups']).transform(lambda x: set(x) >= set(list_vales_regex))]
print (df)
  Groups   Names
0     G1   ABC_9
1     G1   ZTY_2
2     G1   SGG56
3     G1   BBCHU
4     G1  DEFE_8
7     G3   ABC_6
8     G3   DEF98
9     G3   DEF89

Or filter by GroupBy.transform with DataFrameGroupBy.nunique (if not large data, because slowier):

df = df[s.groupby(df['Groups']).transform('nunique').ge(2)]

Another approach is find groups with matched all values of list in list comprehension with np.logical_and.reduce for test if match at least one value by Series.any, then filter groups by first value of list by L[0] and pass to Series.isin:

list_vales_regex=['ABC','DEF']

L = [df.set_index('Groups')['Names'].str.contains(x).any(level=0) for x in list_vales_regex]

df = df[df['Groups'].isin(L[0].index[np.logical_and.reduce(L)])]
print (df)
  Groups   Names
0     G1   ABC_9
1     G1   ZTY_2
2     G1   SGG56
3     G1   BBCHU
4     G1  DEFE_8
7     G3   ABC_6
8     G3   DEF98
9     G3   DEF89

Details:

print (np.logical_and.reduce(L))
[ True False  True False False]

print (L[0].index[np.logical_and.reduce(L)])
Index(['G1', 'G3'], dtype='object', name='Groups')

We could do : `df.loc[s.groupby(df['Groups']).transform('nunique').ge(2)]` — ansev, Jun 02 '20 at 09:45
@ansev - Super, or similar `df = df[s.groupby(df['Groups']).transform('nunique').ge(2)]` — jezrael, Jun 02 '20 at 09:47

Groupby and keep only groups that contain values within a list

1 Answers1