Drop non-unique values in a range of columns based on a condition from a different range of columns

Question

This is a small part of a df.

In this case, I have 3 y-values I need to map: 0.933883, 97.658330 and 1.650013

I have this df

      x  y1  y2         y3         y4          d1  d2         d3         d4
23  5.3 NaN NaN   0.933883        NaN         NaN NaN   0.174866        NaN
25  5.3 NaN NaN        NaN  97.658330         NaN NaN        NaN   0.038670
26  5.3 NaN NaN   1.650013        NaN         NaN NaN   0.541264        NaN
29  5.3 NaN NaN  97.658330        NaN         NaN NaN  96.549581        NaN
30  5.3 NaN NaN        NaN   1.650013         NaN NaN        NaN  96.046987

There is not more than one of these values per column, I already dropped duplicates.

What I need:

I can not have the same value in more than one column.

The condition to choose which row to remove is as shown in this example:

There is 97.658330 in column y3 and y4. Since, for that value, d3(96.549581) is bigger than d4(0.038670), row 29 is removed.

There is 1.650013 in column y3 and y4. Since d4(96.046987) is bigger than d3(0.541264), row 30 is removed.

Output:

      x  y1  y2         y3         y4          d1  d2         d3         d4
23  5.3 NaN NaN   0.933883        NaN         NaN NaN   0.174866        NaN
25  5.3 NaN NaN        NaN  97.658330         NaN NaN        NaN   0.038670
26  5.3 NaN NaN   1.650013        NaN         NaN NaN   0.541264        NaN

P.S. There are a lot more values to map inside the complete data frame.

mozway · Accepted Answer · 2022-11-20T22:34:16.717

2

You can use:

y = df.filter(regex=r'y\d+')
d = df.filter(regex=r'd\d+')

# target = [0.933883, 97.658330, 1.650013]

# define the target values automatically
s = y.stack()
target = set(s[s.duplicated()])
# {1.650013, 97.65833}

drop = set()
for x in target:
    s = d.where(y.eq(x).to_numpy()).stack().droplevel(1)
    drop.update(s.index.difference([s.idxmin()]))

# drop is {29, 30}

out = df.drop(drop)

Output:

      x  y1  y2        y3        y4  d1  d2        d3       d4
23  5.3 NaN NaN  0.933883       NaN NaN NaN  0.174866      NaN
25  5.3 NaN NaN       NaN  97.65833 NaN NaN       NaN  0.03867
26  5.3 NaN NaN  1.650013       NaN NaN NaN  0.541264      NaN

edited Nov 20 '22 at 22:34

answered Nov 20 '22 at 21:54

mozway

194,879
13
39
75

I'm sorry I now realize I didn't formulate my question correctly. This is just part of a df with around 40 rows, so there are a lot more values to map than those 3. – Peter M Nov 20 '22 at 22:25
You can add as many values as you like in `target`. Or do you want to define those automatically? – mozway Nov 20 '22 at 22:26
Yeah the program is supposed to do everything automatically without ever inserting numeric values in the code. – Peter M Nov 20 '22 at 22:29
maybe there is a way to get a variable with all unique values in a df and then use target = [variable] ? – Peter M Nov 20 '22 at 22:32

Bushmaster · Answer 2 · 2022-11-20T21:49:37.340

There may be a more effective solution, but this works. First, let's take the common values in columns y3 and y4 as a list. Then find what are the values of d3 and d4 while y3 and y4 take the common values ? (v1,v2) . Finally Drop row by index number according to specified condition.

vals=sorted(list(df[['y3','y4']].stack()))
dupes = list(set(vals[::2]) & set(vals[1::2])) #https://stackoverflow.com/a/64956890/15415267
#dupes= [1.650013, 97.65833]

for i in dupes:
    v1=df[df['y3']==i]['d3'].iloc[0]
    v2=df[df['y4']==i]['d4'].iloc[0]
    if v1 > v2:
        df=df.drop(df[df['y3']==i]['d3'].index)
    else:
        df=df.drop(df[df['y4']==i]['d4'].index)
print(df)
'''
      x  y1  y2        y3        y4  d1  d2        d3       d4
23  5.3 NaN NaN  0.933883       NaN NaN NaN  0.174866      NaN
25  5.3 NaN NaN       NaN  97.65833 NaN NaN       NaN  0.03867
26  5.3 NaN NaN  1.650013       NaN NaN NaN  0.541264      NaN
'''

Thank you! I do have one problem though. On the part of the df I extracted there are only common values between `y3` and `y4`, but on the full df there can be common values between all of the 4 columns (`y1` to `y4`). I have no clue how to adapt your answer to the 4 columns. — Peter M, Nov 20 '22 at 21:51

Drop non-unique values in a range of columns based on a condition from a different range of columns

2 Answers2