How to keep a specific duplicate from among duplicates?

Question

I have a .csv file.

time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA

lots of rows like:

2022-04-08T02:00:00+02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,

And there are duplicates, like 4 of them, with differences at "Extremum" column Like this:

2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954

Its sorted by 'time', axis=0 (its Column A, column 0 in calc sheet)

csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')

Its 4 dupes at time 17:10:25, how to throw out the not matching one?

Here we have: 41.589, 43.6, 43.6, 43.6. The 41.589 is wrong, and need to be out, and need to keep only 1 copy of remaining 3 dupes (that drop.duplicated can do, but it can't give me 4 dupes to process, it can be set only in 3 ways: keep='first', keep='last' or keep=False, and I need not existent keep=True.. I need return of all 4 dupes, to check which one is that 1 of 4 that's bad to out, before I unique_seen them all, to reduce only to 1, with correct 43.6 in this case. Does someone got any ideas how to achieve that? Saw some ideas here at stack, but can't understand them enough to apply to my case, so I'm kindly asking for help.

you need to keep the 'mode'. https://stackoverflow.com/questions/30216921/how-to-group-by-mode-in-python — Raymond Kwok, Apr 15 '22 at 03:37

score 1 · Accepted Answer · answered Apr 15 '22 at 04:24

You can use twice duplicated with two different modes: keep=False and another of your choice. Then compute a boolean mask from those two for slicing.

Assuming this example dataset:

  date col  other
0    a   a      0
1    a   a      1
2    a   X      2   # unique
3    a   a      3
4    b   Y      4   # unique
5    b   b      5
6    b   b      6
7    b   b      7

You can use:

m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)

df2 = df[m1!=m2]

Output:

  date col  other
0    a   a      0
5    b   b      5

Intermediates:

  date col  other     m1     m2  m1!=m2
0    a   a      0  False   True    True
1    a   a      1   True   True   False
2    a   X      2  False  False   False
3    a   a      3   True   True   False
4    b   Y      4  False  False   False
5    b   b      5  False   True    True
6    b   b      6   True   True   False
7    b   b      7   True   True   False

YES! That was it! im so happy You answered! Altho got warning about "A value is trying to be a set on a copy of a slice from DataFrame" duno if it need more attention, because it works! — ruskiem, Apr 15 '22 at 17:30
@ruskiem this warning come from a previous step in your code, you created a view of your dataframe during slicing. This can be an issue, or not... There are Q/A on this topic on the site. — mozway, Apr 15 '22 at 17:38
Oh, can't see about what the warning is talking about, its realy only this code: https://pastebin.com/6acxC4jb — ruskiem, Apr 16 '22 at 01:29

How to keep a specific duplicate from among duplicates?

1 Answers1