0

I have a dataframe with duplicate entries coming from two sources, all the values should be unique, but one column was not formatted the same, hence I should remove duplicate with different names in one column, but only if the names are within a list.

Technically, I would like to remove a row in a pandas dataframe if there exist another row with the same A and B values, but only if this row’s Z value is 'bar' and the other’s 'Z' is 'foo'.

An example might be clearer:

I have the given dataframe df

 A     B     Z

'a'   'a'   'foo'
'a'   'a'   'bar'
'b'   'a'   'bar'
'c'   'c'   'foo'
'd'   'd'   'blb'

And I would like to get

 A     B     Z

'a'   'a'   'foo'
'b'   'a'   'bar'
'c'   'c'   'foo'
'd'   'd'   'blb'

Note that:

  • The rows with other values than 'foo' and 'bar' in the Z column should not be touched.
  • It’s not important if 'foo' and 'bar' stay the same because they will get changed to the same value afterwards.
  • It would be great to generalize the duo 'foo' and 'bar' as a list.

Attempts so far: Here is my best guess, it doesn’t work though… I don’t understand much what groupby returns. Also I’m sure there is some magical pandas one-liner I just can’t find.

new_df = []
for row in df.groupby('A'):
    if rowloc['Z'].isin('foo'):
         if not row['Z'].isin('bar'):        
            new_df.append(row)

Thanks !

Silver Duck
  • 581
  • 1
  • 5
  • 18
  • Possible duplicate of [Removing duplicates in lists](http://stackoverflow.com/questions/7961363/removing-duplicates-in-lists) – Adonis Apr 07 '17 at 09:48

2 Answers2

1

I think you can get the expected result by concatenating two subsets of the original dataframe:

  • one where Z values are neither foo nor bar
  • and the other one where duplicates according to A and B are dropped

Here's an example that gives me the expected output:

data = """ A     B     Z
a   a   foo
a   a   bar
b   a   bar
c   c   foo
d   d   blb"""
df = pd.read_csv(StringIO(data),sep='\s+')

ls = ['foo','bar']
df1 = pd.concat((df.loc[~(df.Z.isin(ls))], # no foos or bars here
                 df.loc[  df.Z.isin(ls)].drop_duplicates(subset=['A','B'])
                 )).sort_index()

An even simpler option might be to replace foo by bar everywhere in Z and then simply drop duplicates:

df1 = df.replace({'Z':{'foo':'bar'}}).drop_duplicates()

You could even replace both foo and bar by some other value that you're actually going to use:

df1 = df.replace({'Z':{'foo':'xyz', 'bar':'xyz'}}).drop_duplicates()
gereleth
  • 2,452
  • 12
  • 21
0

I'd do this using groupby to simulate checking for duplicates (as you had thought). Instead you group over A and B and then work with each individual grouped DF to check that foo and bar are in Z.

import pandas as pd

df = pd.DataFrame()
df['A'] = ['a', 'a', 'b', 'c', 'd']
df['B'] = ['a', 'a', 'a', 'c', 'd']
df['Z'] = ['foo', 'bar', 'bar', 'foo', 'blib']

VALUES_PRESENT_TO_DROP = ['foo', 'bar']

# Simulate `df.duplicated, keep=False`
grouped = df.groupby(['A', 'B'])

# Start a list of the final DFs to keep, will append to this
dfs_to_keep = []

# PLEASE SEE EDIT BELOW
# We're not interested in the values of thr group, just each df
for _, grouped_df in grouped:
    values_in_col = grouped_df['Z'].unique()
    # Check that all the required values to drop are present
    if all((val in values_in_col for val in VALUES_PRESENT_TO_DROP)):
        # Append just the first row
        dfs_to_keep.append(grouped_df.iloc[[0]])
    else:
        dfs_to_keep.append(grouped_df)

# Combine all into final, deduped DF
df_final = pd.concat(dfs_to_keep).sort_index()

df_final
   A  B     Z
0  a  a   foo
2  b  a   bar
3  c  c   foo
4  d  d  blib

EDIT: Just realised this line: "The rows with other values than 'foo' and 'bar' in the Z column should not be touched."

This modification requires a bit of a change in the logic. See the slot in section below:

# We're not interested in the values of the group, just each df
for _, grouped_df in grouped:
    mask_possibly_edit = grouped_df['Z'].isin(VALUES_PRESENT_TO_DROP)
    # Always keep those that do not have the specified valued in `Z`
    dfs_to_keep.append(grouped_df[~mask_possibly_edit])

    df_possibly_dedupe = grouped_df[mask_possibly_edit]
    values_in_col = df_possibly_dedupe['Z'].unique()
    # Check that all the required values to drop are present
    if all((val in values_in_col for val in VALUES_PRESENT_TO_DROP)):
        # Append just the first row
        dfs_to_keep.append(df_possibly_dedupe.iloc[[0]])
    else:
        dfs_to_keep.append(df_possibly_dedupe)
bastewart
  • 96
  • 4
  • The answer above is much better btw, no needless grouping! – bastewart Apr 07 '17 at 10:19
  • Thanks for taking the time though, I still learned some features with your example :) – Silver Duck Apr 07 '17 at 10:31
  • You're welcome, the [duplicated](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) (used by `drop_duplicates) function is quite useful generally. Can generate a boolean `Series` of if a row is a duplicate. Can use a subset (as above) and optionally not mark the first / last instance as a dupe, or just mark all. – bastewart Apr 07 '17 at 10:52