Dropping rows in pandas if column values are duplicated in more than 2 columns

Question

This question is almost what I need, but I cannot adapt it for my needs.

I have a df with a lot of columns, where the last 8 columns are columns with means scores.

Example

  Column1 Column2  Mean1  Mean2  Mean3  Mean4  Mean5  Mean6  Mean7  Mean8
0       A       X     50     50     50     50     50     50     50     50
1       B       Y     20     21     22     23     24     25     26     27
2       C       Z     50     50     50     63     99     54     24     12
3       D       F     40     41     42     43     44     45     46     47

Reprex

{'Column1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'Column2': {0: 'X', 1: 'Y', 2: 'Z', 3: 'F'}, 'Mean1': {0: 50, 1: 20, 2: 50, 3: 40}, 'Mean2': {0: 50, 1: 21, 2: 50, 3: 41}, 'Mean3': {0: 50, 1: 22, 2: 50, 3: 42}, 'Mean4': {0: 50, 1: 23, 2: 63, 3: 43}, 'Mean5': {0: 50, 1: 24, 2: 99, 3: 44}, 'Mean6': {0: 50, 1: 25, 2: 54, 3: 45}, 'Mean7': {0: 50, 1: 26, 2: 24, 3: 46}, 'Mean8': {0: 50, 1: 27, 2: 12, 3: 47}}

I want to drop all rows in the dataframe, if 3 or more columns in 8 mean columns have the same value.

Expected output (first and third rows were dropped, having value 50 three and more times)

  Column1 Column2  Mean1  Mean2  Mean3  Mean4  Mean5  Mean6  Mean7  Mean8
1       B       Y     20     21     22     23     24     25     26     27
3       D       F     40     41     42     43     44     45     46     47

score 1 · Accepted Answer · answered Aug 28 '21 at 20:42

1

n = list()
for number in df.T.columns.tolist():
    if df.T.groupby(number).size().max()>=3:
        n.append(number)
df.drop(n)

answered Aug 28 '21 at 20:42

versatile_programmer

246
4
13

1

Perfect, now I need to apply it to my huge dataframe. Thanks for your time and efforts. – Anakin Skywalker Aug 28 '21 at 20:57
Do I understand correctly that if I want to remove duplicates only in these 8 columns and do not touch other 200 columns (even with duplicates), I must do the following code:. Going through my 8 columns only and then applying it to the whole dataframe. `n = list() for number in df.iloc[: , 220:228].T.columns.tolist(): if df.iloc[: , 220:228].T.groupby(number).size().max()>=3: n.append(number)`# Drop `foo = df.drop(n)` – Anakin Skywalker Aug 28 '21 at 21:03
1

Firstly create a new dataframe df_new = df.iloc[:,220:228] and apply code to df_new and get n . Finally run df.drop(n). – versatile_programmer Aug 29 '21 at 08:46
Works perfectly, you are the best, thank you so much! – Anakin Skywalker Aug 29 '21 at 13:26

Dropping rows in pandas if column values are duplicated in more than 2 columns

1 Answers1