delete some rows from a dataframe according to conditions

Question

I have a pandas dataframe that contains reviews. And for each review, I have the different words with a specific score as below:

import pandas as pd
df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3,3],
    "review": ["The second review","The second review","The second review",
               "This is the first review","This is the first review",
               "This is the first review","This is the first review",
               "This is the first review",'No Noo', 'No Noo'],
    "token_num":[1,2,3,1,2,3,4,5,1,2],
    "token":["The","second","review","This","is","the","first","review","No","Noo"],
    "score":[0.3,-0.6,0.4,0.5,0.6,0.7,-0.6,0.4,0.5,0.6]
})
print(df)

I need to delete the lines that present the max score for each review. They are defined by the code below:

token_max_score = df.groupby('review_num',sort=False)['score'].idxmax()

Any idea, please ?

I realized that, I changed the duplicate. You need `df.drop(token_max_score)` ;) — mozway, Mar 02 '23 at 10:48

jezrael · Accepted Answer · 2023-03-02T10:46:44.887

If need remove first maximum values per groups use:

df1 = df.drop(df.groupby('review_num',sort=False)['score'].idxmax())
print (df1)
   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.3
1           2         The second review          2  second   -0.6
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.6
6           1  This is the first review          4   first   -0.6
7           1  This is the first review          5  review    0.4
8           3                    No Noo          1      No    0.5

If need remove all maximus per groups here is ouput same:

df2 = df[df['score'].ne(df.groupby('review_num',sort=False)['score'].transform('max'))]
print (df2)
   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.3
1           2         The second review          2  second   -0.6
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.6
6           1  This is the first review          4   first   -0.6
7           1  This is the first review          5  review    0.4
8           3                    No Noo          1      No    0.5

Difference is if duplicated values:

df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3,3],
    "review": ["The second review","The second review","The second review",
               "This is the first review","This is the first review",
               "This is the first review","This is the first review",
               "This is the first review",'No Noo', 'No Noo'],
    "token_num":[1,2,3,1,2,3,4,5,1,2],
    "token":["The","second","review","This","is","the","first","review","No","Noo"],
    "score":[0.4,0.4,0.4,0.5,0.6,0.7,-0.6,0.4,0.5,0.6]
})
print(df)
   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.4 <-multiple max per group 2
1           2         The second review          2  second    0.4 <-multiple max per group 2
2           2         The second review          3  review    0.4 <-multiple max per group 2
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.6
5           1  This is the first review          3     the    0.7
6           1  This is the first review          4   first   -0.6
7           1  This is the first review          5  review    0.4
8           3                    No Noo          1      No    0.5
9           3                    No Noo          2     Noo    0.6

df1 = df.drop(df.groupby('review_num',sort=False)['score'].idxmax())
print (df1)
   review_num                    review  token_num   token  score
1           2         The second review          2  second    0.4
2           2         The second review          3  review    0.4
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.6
6           1  This is the first review          4   first   -0.6
7           1  This is the first review          5  review    0.4
8           3                    No Noo          1      No    0.5

df2 = df[df['score'].ne(df.groupby('review_num',sort=False)['score'].transform('max'))]
print (df2)
   review_num                    review  token_num   token  score
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.6
6           1  This is the first review          4   first   -0.6
7           1  This is the first review          5  review    0.4
8           3                    No Noo          1      No    0.5

Thank you @jezrael. In my case I need to drop just one line – SLA Mar 02 '23 at 11:56 — SLA, Mar 02 '23 at 11:56

delete some rows from a dataframe according to conditions

1 Answers1