remove duplicate rows based on the highest value in another column in Pandas df

Question

I have a pandas df like below

In below df, in index 0,1 & 2,3 ......& 500,501,502 the duplicate values found in X & Y columns, and again the seconds round started with same duplicate values in X & Y column in index 1000, 1001 & 1002,1003 & ....1200,1201.... it goes on but with different weights in weight column.

index     x         y         weight
0         59.644    10.72     0.69
1         59.644    10.72     0.82
2         57.822    10.13     0.75
3         57.822    10.13     0.68
4         57.822    10.13     0.20
.
.
500       53.252    10.85     0.15
501       53.252    10.85     0.95
502       53.252    10.85     0.69
.
.
1000      59.644    10.72     0.85
1001      59.644    10.72     0.73
1002      57.822    10.13     0.92
1003      57.822    10.13     0.15
.
.
.
1200       53.252    10.85     0.78
1201       53.252    10.85     1.098

My requirement

I would like to have my df
1) Avoid repeated/duplicate row values in X & Y which has weight value less than 0.60

2) But still duplicates in X & Y column repeats, So now i want to compare the weight values between duplicate rows & remove the rows which has lesser weight.

3) If I use the below code, it removes all the duplicates between x & y

df_2.groupby(['X', 'Y'], as_index=False,sort=False)['weight'].max()

But I want to compare the first occured duplicates and remove them, then the 2nd, then 3rd and so on ..so that the continuity of duplicate value prevails after some rows. for better understanding, please refer the below required df

How the df should look like:

index     x         y         weight
1         59.644    10.72     0.82
2         57.822    10.13     0.75
.
.
501      53.252    10.85      0.95
.
.
1000      59.644    10.72     0.85
.
1002      57.822    10.13     0.92
.
.
1201       53.252    10.85     1.098   
.
.

I have tried using if statement, but the line of code increases. I believe that there should be an alternate pythonic way which make it easier. (In-built function or using numpy) Any help would be appreciated.

Use `df.groupby(['x', 'y'], as_index=False, sort=False)['weight'].max()` — Erfan, May 23 '19 at 15:07
please find the edited question, i forgot to mention some required parameters @Erfan — Mari, May 24 '19 at 08:21
@Mari - Whats happen if all values per `x` and `y` group are `<0.6` ? — jezrael, May 24 '19 at 08:29
None of the groups of `x` & `y`, where all values not less less that 0.6. many values falls above 0.6 may be 1 or 2 it is less than 0.6. @jezrael — Mari, May 24 '19 at 08:37
@Mari - OK, so it means there is at least one value `<0.6` per group? So then `df.groupby(['x', 'y'], as_index=False, sort=False)['weight'].max()` working nice, because max value is always <0.6. Or something missing? — jezrael, May 24 '19 at 08:40
After his edit I understand it wont work, look at index 500 and 1200, both have the same values in x and y, so they will be treated as one group in the `groupby`. He wants to see them as two different groups. @jezrael — Erfan, May 24 '19 at 08:58
Some groups may also have values not less than 0.6 (i.e) it has values greater than 0.6. But some groups do have. @jezrael — Mari, May 24 '19 at 09:06
@Mari - Can you add/ change values in `weight` with this groups, also with expected output? — jezrael, May 24 '19 at 09:07
Which means whether i can change manually the group weights ??? Is that you mean ??@jezrael — Mari, May 24 '19 at 09:13

score 1 · Accepted Answer · answered May 24 '19 at 09:05

Like @Erfan mentioned in comments, here is necessary grouping by helper Series for distinguish consecutive groups:

x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()

df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
    index       x      y  weight
1       1  59.644  10.72   0.820
2       2  57.822  10.13   0.750
6     501  53.252  10.85   0.950
8    1000  59.644  10.72   0.850
10   1002  57.822  10.13   0.920
13   1201  53.252  10.85   1.098

I see these types of questions more and more, people want to groupby, but handle the groups as seperate groups when they have a distance in the dataframe. — Erfan, May 24 '19 at 10:28

remove duplicate rows based on the highest value in another column in Pandas df

1 Answers1