1

Hel lo, I have a dataframe such as

col1 col2
G1 OP2
G1 OP0
G1 OPP
G1 OPL_Lh
G2 OII
G2 OIP
G2 IOP
G3 TYU
G4 TUI
G4 TYUI
G4 TR_Lh

and i would like to groupby and remove from the df tha groups that does not contain at leats one row in col2 that contain

'_Lh' 

here I should only keep G1 and G4 and get :

col1 col2
G1 OP2
G1 OP0
G1 OPP
G1 OPL_Lh
G4 TUI
G4 TYUI
G4 TR_Lh

Does someone have an idea ? thank you

Grendel
  • 783
  • 4
  • 12

3 Answers3

1

IIUC,

you can use a boolean test and isin to filter in the groups that contain _Lh

m = df[df['col2'].str.contains('_Lh')]['col1']

df[df['col1'].isin(m)].groupby('col1')...

print(df[df['col1'].isin(m)])

   col1    col2
0    G1     OP2
1    G1     OP0
2    G1     OPP
3    G1  OPL_Lh
8    G4     TUI
9    G4    TYUI
10   G4   TR_Lh
Umar.H
  • 22,559
  • 7
  • 39
  • 74
1

You can do:

filter_=df.loc[df["col2"].str.contains("_Lh"), "col1"].drop_duplicates()

df=df.merge(filter_, on="col1")

Outputs:

  col1    col2
0   G1     OP2
1   G1     OP0
2   G1     OPP
3   G1  OPL_Lh
4   G4     TUI
5   G4    TYUI
6   G4   TR_Lh
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
0

Here's a long way to solve this, to illustrate how groupby works.

Begin by creating a function which tests for the string you want:

def contains_str(x, string = '_Lh'):
    if string in x:
        return True
    else:
        return False

Next, iterate over your groups and apply this function:

keep_dict = {}

for label, group_df in df.groupby('col1'):
    keep = group_df['col2'].apply(contains_str).any()
    keep_dict[label] = keep

print(keep_dict)
# {'G1': True, 'G2': False, 'G3': False, 'G4': True}

Feel free to print individual items in the operation to understand their role.

Finally, map that dictionary to your current df:

df_final = df[df['col1'].map(keep_dict)].reset_index(drop=True)

    col1    col2
0   G1      OP2
1   G1      OP0
2   G1      OPP
3   G1      OPL_Lh
4   G4      TUI
5   G4      TYUI
6   G4      TR_Lh

You can condense these steps using the following code:

keep_dict = df.groupby('col1', as_index=True)['col2'].apply(lambda arr: any([contains_str(x) for x in arr])).to_dict()

print(keep_dict)
# {'G1': True, 'G2': False, 'G3': False, 'G4': True}

I hope this both answers your Q and explains what's taking place "behind the scenes" in groupby operations.

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • @Grendel this should not be the accepted answer, the pandas API is built to explicitly avoid looping and `apply` should only be used as a last resort. use vectorised operations which take advantage of the core pandas code which is written in `C` – Umar.H Mar 22 '20 at 22:34
  • See final step @Datanovice `"...condense..."` – there are plenty reasons to accept this answer –> code is resilient to dynamic input, can handle groupby conditions etc. – Yaakov Bressler Mar 22 '20 at 23:00
  • Your final step is using apply, there is no need for it. – Umar.H Mar 22 '20 at 23:43
  • There are plenty reasons to use `apply` – Yaakov Bressler Mar 22 '20 at 23:48
  • 1
    have a read of [this](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code) – Umar.H Mar 23 '20 at 10:35
  • Thanks for pointing this out @Datanovice – I'll keep in mind in the future. – Yaakov Bressler Mar 23 '20 at 18:56