-1

I hope you're doing well. I want to drop duplicates rows based on some conditions.

For example :

    A   B   C   D   E
0   foo 2   3   4   100
1   foo 2   3   1   3
2   foo 2   3   5   nan
3   bar 1   2   8   nan
4   bar 1   2   1   nan

The result should be

    A   B   C   D   E
0   foo 2   3   4   100
1   foo 2   3   1   3
2   bar 1   2   nan nan

So we have duplicated rows (based on columns A,B, and C), first we check the value in column E if it's nan we drop the row but if all values in column E are nan (like the example of row 3 and 4 concerning the name 'bar'), we should keep one row and set the value in column D as nan.

Thanks in advance.

1 Answers1

1

It works

import pandas as pd
import io

table = """
    A   B   C   D   E
0   foo 2   3   4   100
1   foo 2   3   1   3
2   foo 2   3   5   nan
3   bar 1   2   8   nan
4   bar 1   2   1   nan
"""
df = pd.read_table(io.StringIO(table), index_col=0, sep=' ', skipinitialspace=True)

# Index for duplicated in A,B,C and all nan in E
index_1 = set(df[df.duplicated(['A','B','C','E'], keep=False)]["E"].isna().index)

# Index for duplicated ABC and nan in E
index_2 = set(df[df[df.duplicated(['A','B','C'], keep=False)]["E"].isna()].index)

# Set nan for D in index_1
df.loc[index_1, 'D'] = np.nan

# Drop nan E with duplicated ABC except index_1
df.drop(index_2-index_1, inplace=True)

# Drop other duplicates
df.drop_duplicates(['A','B','C','D'], inplace=True)

print(df)

This is what was required:

     A  B  C    D      E
0  foo  2  3  4.0  100.0
1  foo  2  3  1.0    3.0
3  bar  1  2  NaN    NaN
nigani
  • 328
  • 1
  • 7