0

I have a pandas df like below

In below df, in index 0,1 & 2,3 ......& 500,501,502 the duplicate values found in X & Y columns, and again the seconds round started with same duplicate values in X & Y column in index 1000, 1001 & 1002,1003 & ....1200,1201.... it goes on but with different weights in weight column.

index     x         y         weight
0         59.644    10.72     0.69
1         59.644    10.72     0.82
2         57.822    10.13     0.75
3         57.822    10.13     0.68
4         57.822    10.13     0.20
.
.
500       53.252    10.85     0.15
501       53.252    10.85     0.95
502       53.252    10.85     0.69
.
.
1000      59.644    10.72     0.85
1001      59.644    10.72     0.73
1002      57.822    10.13     0.92
1003      57.822    10.13     0.15
.
.
.
1200       53.252    10.85     0.78
1201       53.252    10.85     1.098        

My requirement

I would like to have my df
1) Avoid repeated/duplicate row values in X & Y which has weight value less than 0.60

2) But still duplicates in X & Y column repeats, So now i want to compare the weight values between duplicate rows & remove the rows which has lesser weight.

3) If I use the below code, it removes all the duplicates between x & y

df_2.groupby(['X', 'Y'], as_index=False,sort=False)['weight'].max()

But I want to compare the first occured duplicates and remove them, then the 2nd, then 3rd and so on ..so that the continuity of duplicate value prevails after some rows. for better understanding, please refer the below required df

How the df should look like:

index     x         y         weight
1         59.644    10.72     0.82
2         57.822    10.13     0.75
.
.
501      53.252    10.85      0.95
.
.
1000      59.644    10.72     0.85
.
1002      57.822    10.13     0.92
.
.
1201       53.252    10.85     1.098   
.
.

I have tried using if statement, but the line of code increases. I believe that there should be an alternate pythonic way which make it easier. (In-built function or using numpy) Any help would be appreciated.

Mari
  • 698
  • 1
  • 8
  • 27
  • 1
    Use `df.groupby(['x', 'y'], as_index=False, sort=False)['weight'].max()` – Erfan May 23 '19 at 15:07
  • please find the edited question, i forgot to mention some required parameters @Erfan – Mari May 24 '19 at 08:21
  • @Mari - Whats happen if all values per `x` and `y` group are `<0.6` ? – jezrael May 24 '19 at 08:29
  • None of the groups of `x` & `y`, where all values not less less that 0.6. many values falls above 0.6 may be 1 or 2 it is less than 0.6. @jezrael – Mari May 24 '19 at 08:37
  • @Mari - OK, so it means there is at least one value `<0.6` per group? So then `df.groupby(['x', 'y'], as_index=False, sort=False)['weight'].max()` working nice, because max value is always <0.6. Or something missing? – jezrael May 24 '19 at 08:40
  • 1
    After his edit I understand it wont work, look at index 500 and 1200, both have the same values in x and y, so they will be treated as one group in the `groupby`. He wants to see them as two different groups. @jezrael – Erfan May 24 '19 at 08:58
  • @Erfan - OK, so reopened. – jezrael May 24 '19 at 09:03
  • Some groups may also have values not less than 0.6 (i.e) it has values greater than 0.6. But some groups do have. @jezrael – Mari May 24 '19 at 09:06
  • @Mari - Can you add/ change values in `weight` with this groups, also with expected output? – jezrael May 24 '19 at 09:07
  • Which means whether i can change manually the group weights ??? Is that you mean ??@jezrael – Mari May 24 '19 at 09:13
  • @Mari - I think in sample data in question. Is it possible? – jezrael May 24 '19 at 09:22
  • Ok ok ...I will do that- @jezrael – Mari May 24 '19 at 09:26

1 Answers1

1

Like @Erfan mentioned in comments, here is necessary grouping by helper Series for distinguish consecutive groups:

x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()

df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
    index       x      y  weight
1       1  59.644  10.72   0.820
2       2  57.822  10.13   0.750
6     501  53.252  10.85   0.950
8    1000  59.644  10.72   0.850
10   1002  57.822  10.13   0.920
13   1201  53.252  10.85   1.098
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    I see these types of questions more and more, people want to groupby, but handle the groups as seperate groups when they have a distance in the dataframe. – Erfan May 24 '19 at 10:28