1

I want to add condition to extract the duplicate rows in a dataframe

DF

KEY    STAT    NUM      ID
ab      L       3       1678
cd      D       4       23221
ab      D       8       1678
cd      L       0       38754

For duplicate key I need to check for ID if that is same then keep both the rows. If ID is different then I need to check NUM and keep the one with greater NUM value and ignore other

Expected OUTPUT

KEY    STAT    NUM      ID
ab      L       3       1678
cd      D       4       23221
ab      D       8       1678
  • I have edited the scope a bit because I needed to do a merge on the KEY column.So I am planning to pull out both the duplicate values in the merged table. This is the code that I wrote keeping @piRSquared answer as reference but not getting desired dataframe.
def func(d):
    if d.KEY.nunique()>=1:
        if d.ID.nunique()>=1:
            return d
        else:
            return d.nlargest(1,columns=['NUM'])
    else:
        return d
pd.concat([func(d) for _, d in DF.groupby('KEY')])

table_101
  • 139
  • 2
  • 9

3 Answers3

1

Create a function and use it for each group

def f(d):
    if d.ID.nunique() == 1:
        return d.assign(KEY=d.KEY.str.cat(d.STAT, sep='+'))
    else:
        return d.nlargest(1, columns=['NUM'])

pd.concat([f(d) for _, d in df.groupby('KEY')])

    KEY STAT  NUM  ID
0  ab+L    L    3   1
2  ab+D    D    8   1
1    cd    D    4   2
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • "d.ID.nunique() == 1" can you just explain this line because ID is a very complex number and there are 1000 entries I cannot add every id number. – table_101 May 02 '19 at 06:05
0

Try

# Sort values by NUM
df.sort_values(by='NUM', ascending=False, inplace=True)

# Find which group len > 1
mask = df.groupby(['KEY', 'ID']).ID.transform('count') > 1

df.loc[mask, 'KEY'] = df[mask].KEY + '+' + df[mask].STAT

# Get duplicated KEY index
id_drop = df.loc[~mask].drop_duplicates('KEY', keep='last').index

# Drop the duplicated KEY
df.drop(index=id_drop, inplace=True)

df

    KEY STAT  NUM  ID
2  ab+D    D    8   1
1    cd    D    4   2
0  ab+L    L    3   1
ResidentSleeper
  • 2,385
  • 2
  • 10
  • 20
0

I found a good answer in counting the frequency of a value in a column of a dataframe here:

count the frequency that a value occurs in a dataframe column

Basically to extract duplicates in a dataframe try:

# create a count column
df['Count'] = df.groupby('your_col')['your_col'].transform('count')
# Filter out Count = 1
df = df[(df.Count > 1)]
Arthur D. Howland
  • 4,363
  • 3
  • 21
  • 31