Removing duplicates in pandas data frame if one column differs, but is in a given list

Question

I have a dataframe with duplicate entries coming from two sources, all the values should be unique, but one column was not formatted the same, hence I should remove duplicate with different names in one column, but only if the names are within a list.

Technically, I would like to remove a row in a pandas dataframe if there exist another row with the same A and B values, but only if this row’s Z value is 'bar' and the other’s 'Z' is 'foo'.

An example might be clearer:

I have the given dataframe df

 A     B     Z

'a'   'a'   'foo'
'a'   'a'   'bar'
'b'   'a'   'bar'
'c'   'c'   'foo'
'd'   'd'   'blb'

And I would like to get

 A     B     Z

'a'   'a'   'foo'
'b'   'a'   'bar'
'c'   'c'   'foo'
'd'   'd'   'blb'

Note that:

The rows with other values than 'foo' and 'bar' in the Z column should not be touched.
It’s not important if 'foo' and 'bar' stay the same because they will get changed to the same value afterwards.
It would be great to generalize the duo 'foo' and 'bar' as a list.

Attempts so far: Here is my best guess, it doesn’t work though… I don’t understand much what groupby returns. Also I’m sure there is some magical pandas one-liner I just can’t find.

new_df = []
for row in df.groupby('A'):
    if rowloc['Z'].isin('foo'):
         if not row['Z'].isin('bar'):        
            new_df.append(row)

Thanks !

Possible duplicate of [Removing duplicates in lists](http://stackoverflow.com/questions/7961363/removing-duplicates-in-lists) — Adonis, Apr 07 '17 at 09:48

gereleth · Accepted Answer · 2017-04-07T10:29:48.437

I think you can get the expected result by concatenating two subsets of the original dataframe:

one where Z values are neither foo nor bar
and the other one where duplicates according to A and B are dropped

Here's an example that gives me the expected output:

data = """ A     B     Z
a   a   foo
a   a   bar
b   a   bar
c   c   foo
d   d   blb"""
df = pd.read_csv(StringIO(data),sep='\s+')

ls = ['foo','bar']
df1 = pd.concat((df.loc[~(df.Z.isin(ls))], # no foos or bars here
                 df.loc[  df.Z.isin(ls)].drop_duplicates(subset=['A','B'])
                 )).sort_index()

An even simpler option might be to replace foo by bar everywhere in Z and then simply drop duplicates:

df1 = df.replace({'Z':{'foo':'bar'}}).drop_duplicates()

You could even replace both foo and bar by some other value that you're actually going to use:

df1 = df.replace({'Z':{'foo':'xyz', 'bar':'xyz'}}).drop_duplicates()

Thanks a lot, I didn’t realize you could use drop_duplicates using a subset. Have a lovely day. — Silver Duck, Apr 07 '17 at 10:30

bastewart · Answer 2 · 2017-04-07T10:20:51.440

I'd do this using groupby to simulate checking for duplicates (as you had thought). Instead you group over A and B and then work with each individual grouped DF to check that foo and bar are in Z.

import pandas as pd

df = pd.DataFrame()
df['A'] = ['a', 'a', 'b', 'c', 'd']
df['B'] = ['a', 'a', 'a', 'c', 'd']
df['Z'] = ['foo', 'bar', 'bar', 'foo', 'blib']

VALUES_PRESENT_TO_DROP = ['foo', 'bar']

# Simulate `df.duplicated, keep=False`
grouped = df.groupby(['A', 'B'])

# Start a list of the final DFs to keep, will append to this
dfs_to_keep = []

# PLEASE SEE EDIT BELOW
# We're not interested in the values of thr group, just each df
for _, grouped_df in grouped:
    values_in_col = grouped_df['Z'].unique()
    # Check that all the required values to drop are present
    if all((val in values_in_col for val in VALUES_PRESENT_TO_DROP)):
        # Append just the first row
        dfs_to_keep.append(grouped_df.iloc[[0]])
    else:
        dfs_to_keep.append(grouped_df)

# Combine all into final, deduped DF
df_final = pd.concat(dfs_to_keep).sort_index()

df_final
   A  B     Z
0  a  a   foo
2  b  a   bar
3  c  c   foo
4  d  d  blib

EDIT: Just realised this line: "The rows with other values than 'foo' and 'bar' in the Z column should not be touched."

This modification requires a bit of a change in the logic. See the slot in section below:

# We're not interested in the values of the group, just each df
for _, grouped_df in grouped:
    mask_possibly_edit = grouped_df['Z'].isin(VALUES_PRESENT_TO_DROP)
    # Always keep those that do not have the specified valued in `Z`
    dfs_to_keep.append(grouped_df[~mask_possibly_edit])

    df_possibly_dedupe = grouped_df[mask_possibly_edit]
    values_in_col = df_possibly_dedupe['Z'].unique()
    # Check that all the required values to drop are present
    if all((val in values_in_col for val in VALUES_PRESENT_TO_DROP)):
        # Append just the first row
        dfs_to_keep.append(df_possibly_dedupe.iloc[[0]])
    else:
        dfs_to_keep.append(df_possibly_dedupe)

Thanks for taking the time though, I still learned some features with your example :) — Silver Duck, Apr 07 '17 at 10:31
You're welcome, the [duplicated](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) (used by `drop_duplicates) function is quite useful generally. Can generate a boolean `Series` of if a row is a duplicate. Can use a subset (as above) and optionally not mark the first / last instance as a dupe, or just mark all. — bastewart, Apr 07 '17 at 10:52

Removing duplicates in pandas data frame if one column differs, but is in a given list

2 Answers2