-1

I'm working with a pandas dataframe and I want to remove duplicate rows based on the column ID value, but among the duplicate rows, I want to keep the row if the Value column has a value.

I know of

.drop_duplicates(subset="ID", keep="first")

, but that would keep duplicate rows if the Value cells are different.

Input Table:

ID Value
A qwer
B asdf
A
C

Output Table:

ID Value
A qwer
B asdf
C

Thanks

Anon Li
  • 561
  • 1
  • 6
  • 18
  • `df.drop_duplicates(subset=['A','qwer'])`? – wwnde Feb 24 '21 at 21:19
  • @wwnde I can't do that for every row though...:/ – Anon Li Feb 24 '21 at 21:22
  • `@wwnde I can't do that for every row though...:/` why cant you do it that way? Is there something I missed – wwnde Feb 24 '21 at 21:24
  • @wwnde wouldn't that only apply to the first row of my example Input Table? What about other rows where there's duplicate ID but not duplicate Value? – Anon Li Feb 24 '21 at 21:24
  • Edits after your question edit `df.drop_duplicates(subset=['ID','Value'])` – wwnde Feb 24 '21 at 21:25
  • Does this answer your question? [Drop all duplicate rows across multiple columns in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas) – wwnde Feb 24 '21 at 21:26
  • @wwnde no, `df.drop_duplicates(subset=['ID','Value'])` would keep both the first row and the last row of the Input Table, no? – Anon Li Feb 24 '21 at 21:27
  • `df1=df.drop_duplicates(subset=['ID','Value'])` or `df.drop_duplicates(subset=['ID','Value'], inplace=True)` – wwnde Feb 24 '21 at 21:29

1 Answers1

0

I believe this is from your logic:

# mark the duplicated rows
duplicated = df['ID'].duplicated()

# non-nan rows
# consider `.ne('')` if you are searching for blank value
not_empty = df['Value'].notna()

# keep rows that are not duplicated or non-nan
# that is to remove those that are **both** duplicates and has empty value
df[(~duplicated) | not_empty]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74