Pandas : drop duplicated row based on some conditions

Question

I hope you're doing well. I want to drop duplicates rows based on some conditions.

For example :

    A   B   C   D   E
0   foo 2   3   4   100
1   foo 2   3   1   3
2   foo 2   3   5   nan
3   bar 1   2   8   nan
4   bar 1   2   1   nan

The result should be

    A   B   C   D   E
0   foo 2   3   4   100
1   foo 2   3   1   3
2   bar 1   2   nan nan

So we have duplicated rows (based on columns A,B, and C), first we check the value in column E if it's nan we drop the row but if all values in column E are nan (like the example of row 3 and 4 concerning the name 'bar'), we should keep one row and set the value in column D as nan.

Thanks in advance.

Did you check this? `https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html` — Forest 1, Jul 30 '21 at 15:59
`https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas` — Forest 1, Jul 30 '21 at 16:02
also the index 1 is duplicate of index 0 in your expected output based on columns A,B, and C — Anurag Dabas, Jul 30 '21 at 16:08
@AnuragDabas Yes, but we keep it because the value in E is not nan, we remove it just when the E value is nan. — Abdelfattah Boutanes, Jul 30 '21 at 16:16
The task is not so trivial, there are several additional conditions — nigani, Jul 30 '21 at 16:41

score 1 · Answer 1 · answered Jul 30 '21 at 17:18

It works

import pandas as pd
import io

table = """
    A   B   C   D   E
0   foo 2   3   4   100
1   foo 2   3   1   3
2   foo 2   3   5   nan
3   bar 1   2   8   nan
4   bar 1   2   1   nan
"""
df = pd.read_table(io.StringIO(table), index_col=0, sep=' ', skipinitialspace=True)

# Index for duplicated in A,B,C and all nan in E
index_1 = set(df[df.duplicated(['A','B','C','E'], keep=False)]["E"].isna().index)

# Index for duplicated ABC and nan in E
index_2 = set(df[df[df.duplicated(['A','B','C'], keep=False)]["E"].isna()].index)

# Set nan for D in index_1
df.loc[index_1, 'D'] = np.nan

# Drop nan E with duplicated ABC except index_1
df.drop(index_2-index_1, inplace=True)

# Drop other duplicates
df.drop_duplicates(['A','B','C','D'], inplace=True)

print(df)

This is what was required:

     A  B  C    D      E
0  foo  2  3  4.0  100.0
1  foo  2  3  1.0    3.0
3  bar  1  2  NaN    NaN

Pandas : drop duplicated row based on some conditions

1 Answers1