I have a dataframe with duplicate entries coming from two sources, all the values should be unique, but one column was not formatted the same, hence I should remove duplicate with different names in one column, but only if the names are within a list.
Technically, I would like to remove a row in a pandas dataframe if there exist another row with the same A
and B
values, but only if this row’s Z
value is 'bar'
and the other’s 'Z' is 'foo'
.
An example might be clearer:
I have the given dataframe df
A B Z
'a' 'a' 'foo'
'a' 'a' 'bar'
'b' 'a' 'bar'
'c' 'c' 'foo'
'd' 'd' 'blb'
And I would like to get
A B Z
'a' 'a' 'foo'
'b' 'a' 'bar'
'c' 'c' 'foo'
'd' 'd' 'blb'
Note that:
- The rows with other values than
'foo'
and'bar'
in theZ
column should not be touched. - It’s not important if
'foo'
and'bar'
stay the same because they will get changed to the same value afterwards. - It would be great to generalize the duo
'foo'
and'bar'
as a list.
Attempts so far: Here is my best guess, it doesn’t work though… I don’t understand much what groupby returns. Also I’m sure there is some magical pandas one-liner I just can’t find.
new_df = []
for row in df.groupby('A'):
if rowloc['Z'].isin('foo'):
if not row['Z'].isin('bar'):
new_df.append(row)
Thanks !