Remove duplicate rows from dataframe if

Question

I'm working with a pandas dataframe and I want to remove duplicate rows based on the column ID value, but among the duplicate rows, I want to keep the row if the Value column has a value.

I know of

.drop_duplicates(subset="ID", keep="first")

, but that would keep duplicate rows if the Value cells are different.

Input Table:

ID	Value
A	qwer
B	asdf
A
C

Output Table:

ID	Value
A	qwer
B	asdf
C

Thanks

`@wwnde I can't do that for every row though...:/` why cant you do it that way? Is there something I missed — wwnde, Feb 24 '21 at 21:24
@wwnde wouldn't that only apply to the first row of my example Input Table? What about other rows where there's duplicate ID but not duplicate Value? — Anon Li, Feb 24 '21 at 21:24
Edits after your question edit `df.drop_duplicates(subset=['ID','Value'])` — wwnde, Feb 24 '21 at 21:25
Does this answer your question? [Drop all duplicate rows across multiple columns in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas) — wwnde, Feb 24 '21 at 21:26
@wwnde no, `df.drop_duplicates(subset=['ID','Value'])` would keep both the first row and the last row of the Input Table, no? — Anon Li, Feb 24 '21 at 21:27
`df1=df.drop_duplicates(subset=['ID','Value'])` or `df.drop_duplicates(subset=['ID','Value'], inplace=True)` — wwnde, Feb 24 '21 at 21:29

score 0 · Answer 1 · answered Feb 24 '21 at 21:55

I believe this is from your logic:

# mark the duplicated rows
duplicated = df['ID'].duplicated()

# non-nan rows
# consider `.ne('')` if you are searching for blank value
not_empty = df['Value'].notna()

# keep rows that are not duplicated or non-nan
# that is to remove those that are **both** duplicates and has empty value
df[(~duplicated) | not_empty]

Remove duplicate rows from dataframe if

1 Answers1