Drop duplicate if the value in another column is null - Pandas

Question

What I have:

df

Name |Vehicle

Dave |Car
Mark |Bike
Steve|Car
Dave |
Steve|

I want to drop duplicates from the Name column but only if the corresponding value in Vehicle column is null. I know I can use

 df.dropduplicates(subset=['Name'])

with either Keep = either 'First' or 'Last' but what I am looking for is a way to drop duplicates from Name column where the corresponding value of Vehicle column is null. So basically, keep the Name if the Vehicle column is NOT null and drop the rest. If a name does not have a duplicate,then keep that row even if the corresponding value in Vehicle is null.

Many Thanks

score 10 · Accepted Answer · edited Jul 06 '22 at 12:44

10

I think you need chain 2 masks with bitwise OR (|) with Series.notna and Series.duplicated:

m1 = df['Vehicle'].notna()
m2 = ~df['Name'].duplicated()

df1 = df[m1 & m2]
print (df1)
    Name Vehicle
0   Dave     Car
1   Mark    Bike
2  Steve     Car

If want these operations separately - first remove all NaNs rows and then remove duplicates for avoid testing duplicates in NaNs rows (if necessary):

df2 = df.dropna(subset=['Vehicle']).drop_duplicates('Name')
print (df2)
    Name Vehicle
0   Dave     Car
1   Mark    Bike
2  Steve     Car

edited Jul 06 '22 at 12:44

Attila the Fun

327
2
13

answered Dec 30 '19 at 14:58

jezrael

822,522
95
1,334
1,252

1

Brilliant. Great logic! Thank you – Nithin Nampoothiry Dec 30 '19 at 15:26
This originally said bitwise `OR` (`|`), which is correct. Bitwise AND is wrong—that will drop rows with *either* duplicates in Name *or* NaN in Vehicle. – Attila the Fun Jul 06 '22 at 00:15
@AttilatheFun - ya, it depends. If added new line with `John,NaN` then need `&` if need remove this row or use `|` if need keepthis row. – jezrael Jul 06 '22 at 06:51

score 2 · Answer 2 · edited Jan 18 '21 at 07:20

this will filter out both None and empty values (IF there are any non-None or non-empty values that is), keeping just the first encountered value for Vehicle

import pandas as pd

df = pd.DataFrame({"Name": ["Dave", "Mark", "Steve", "Dave", "Steve"], "Vehicle": ["Car", "Bike", "Car", None, ""]})

res = df.sort_values("Vehicle", ascending=False).groupby("Name")["Vehicle"].first().reset_index()

Output:

    Name Vehicle
0   Dave     Car
1   Mark    Bike
2  Steve     Car

Drop duplicate if the value in another column is null - Pandas

2 Answers2

Linked