How to delete rows that contain the same string again and again

Question

I have a DataFrame like below

  Name                                    Mail-Body
 Oliver                         I am recently doing AAA, BBB and BBB....
  Jack                          Here is my report. It seemed AAA was.. so AAA is..   
 Jacob                          How are you doing? Next week we launch our AAA porject...

And with this DataFrame, I would like perform some data analysis. But I found out emails containing names such as "AAA" and "BBB"so many times tend to be just a scheduling notification and something like that so it is pretty much meaningless. So I would like to drop all the rows that contain the same string such as "AAA" and "BBB"more than 5 times in Mail-Body column.

Is there any pythonic way to drop all rows?

Is this what you are looking for: https://stackoverflow.com/questions/13851535/delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression-involving ? — Cleared, Mar 24 '20 at 09:06

jezrael · Answer 1 · 2020-03-24T11:31:52.957

Sample:

print (df)
     Name                                        Mail-Body
0  Oliver     I AAA BBB am recently doing AAA, BBB and BBB
1    Jack   AAA AAA. AAA BBB It seemed AAA was.. so AAA is
2   Jacob      AAA AAA BBB BBB AAA BBB AAA AAA BBB BBB BBB
3     Bal                                          AAA BBB

If want remove rows with AAA more like 5 times it means need filter rows with less like 5 values of AAA with Series.str.count, Series.lt and boolean indexing:

df0 = df[df['Mail-Body'].str.count('AAA').lt(5)]
print (df0)
     Name                                     Mail-Body
0  Oliver  I AAA BBB am recently doing AAA, BBB and BBB
3     Bal                                       AAA BBB

If want filter like AAA or BBB values sum together per row, not important how much AAA and how much BBB use AAA|BBB pattern:

df1 = df[df['Mail-Body'].str.count('AAA|BBB').lt(5)]
print (df1)
  Name Mail-Body
3  Bal   AAA BBB

If want test separately AAA and BBB - chain masks by | for bitwise OR - so it means filter less like 5 values of AAA or less like 5 values B:

df2 = df[df['Mail-Body'].str.count('AAA').lt(5) | df['Mail-Body'].str.count('BBB').lt(5)]
print (df2)
     Name                                        Mail-Body
0  Oliver     I AAA BBB am recently doing AAA, BBB and BBB
1    Jack   AAA AAA. AAA BBB It seemed AAA was.. so AAA is
3     Bal                                          AAA BBB

And if want filter with and by & for bitwise AND solution is:

df3 = df[df['Mail-Body'].str.count('AAA').lt(5) & df['Mail-Body'].str.count('BBB').lt(5)]
print (df3)
     Name                                     Mail-Body
0  Oliver  I AAA BBB am recently doing AAA, BBB and BBB
3     Bal                                       AAA BBB

Thank you for your comment. But I would like to delete rows containing "the same particular phrase more than 5 times" so it is not necessarily "AAA" only. I would like to delete rows containing "BBB" more than five times as well as "CCC" and such. Is there a way to recognize the same string in python? — , Mar 24 '20 at 10:54
@wannabeAIengineer - Not 100% sure how need filtering, so answer was edited with all solutions, please check it. — jezrael, Mar 24 '20 at 11:32

How to delete rows that contain the same string again and again

1 Answers1