Pandas / Python remove duplicates based on specific row values

Question

I am trying to remove duplicated based on multiple criteria:

Find duplicated in column df['A']
Check column df['status'] and prioritize OK vs Open and Open vs Close
if we have a duplicate with same status pick the lates one based on df['Col_1]

df = pd.DataFrame({'A' : ['11', '11', '12', np.nan, '13', '13', '14', '14', '15'], 'Status' : ['OK','Close','Close','OK','OK','Open','Open','Open',np.nan], 'Col_1' :[2000, 2001, 2000, 2000, 2000, 2002, 2000, 2004, 2000]}) df

Expected output:

I have tried differente solutions like the links below (map or loc) but I am unable to find the correct way:

Pandas : remove SOME duplicate values based on conditions

jezrael · Accepted Answer · 2020-11-16T10:03:07.687

1

Create ordered categorical for prioritize Status, then sorting per all columns, remove duplicates by first column A and last sorting index:

c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)

df = df.sort_values(['A','Status','Col_1']).drop_duplicates('A').sort_index()
print (df)
     A Status  Col_1
0   11     OK   2000
2   12  Close   2000
3  NaN     OK   2000
4   13     OK   2000
6   14   Open   2000
8   15    NaN   2000

EDIT If need avoid NaNs are removed add helper column:

df['test'] = df['A'].isna().cumsum()

c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)

df = (df.sort_values(['A','Status','Col_1', 'test'])
        .drop_duplicates(['A', 'test'])
        .sort_index())

edited Nov 16 '20 at 10:03

answered Nov 16 '20 at 08:57

jezrael

822,522
95
1,334
1,252

1

Thank you very much and is there a way to keep all NaNs in column A if we have multiple NaNs? – Caiotru Nov 16 '20 at 09:41
I just realized that the code works fine a part from the dates which are not selected by the latest one – Caiotru Nov 16 '20 at 10:30
@Caiotru - Can you explain more? – jezrael Nov 16 '20 at 10:31
@Caiotru - One idea - are dates strings? Or datetimes? Or numbers? – jezrael Nov 16 '20 at 10:34
For example if I have 2 OKs with the same code I would like the latest date to be picked, I would like to keep = 'last' if they are in order – Caiotru Nov 16 '20 at 10:34
@Caiotru - Is possible change data sample for see problem? And also please test my solution if working with new data. – jezrael Nov 16 '20 at 10:35

Pandas / Python remove duplicates based on specific row values

1 Answers1