How to filter on the duplicate rows of a dataframe?

Question

I want to add condition to extract the duplicate rows in a dataframe

DF

KEY    STAT    NUM      ID
ab      L       3       1678
cd      D       4       23221
ab      D       8       1678
cd      L       0       38754

For duplicate key I need to check for ID if that is same then keep both the rows. If ID is different then I need to check NUM and keep the one with greater NUM value and ignore other

Expected OUTPUT

KEY    STAT    NUM      ID
ab      L       3       1678
cd      D       4       23221
ab      D       8       1678

I have edited the scope a bit because I needed to do a merge on the KEY column.So I am planning to pull out both the duplicate values in the merged table. This is the code that I wrote keeping @piRSquared answer as reference but not getting desired dataframe.

def func(d):
    if d.KEY.nunique()>=1:
        if d.ID.nunique()>=1:
            return d
        else:
            return d.nlargest(1,columns=['NUM'])
    else:
        return d

pd.concat([func(d) for _, d in DF.groupby('KEY')])

please include the code you've tried to get the expected result — Derek Pollard, May 01 '19 at 20:09

score 1 · Answer 1 · answered May 01 '19 at 20:29

1

Create a function and use it for each group

def f(d):
    if d.ID.nunique() == 1:
        return d.assign(KEY=d.KEY.str.cat(d.STAT, sep='+'))
    else:
        return d.nlargest(1, columns=['NUM'])

pd.concat([f(d) for _, d in df.groupby('KEY')])

    KEY STAT  NUM  ID
0  ab+L    L    3   1
2  ab+D    D    8   1
1    cd    D    4   2

answered May 01 '19 at 20:29

piRSquared

285,575
57
475
624

"d.ID.nunique() == 1" can you just explain this line because ID is a very complex number and there are 1000 entries I cannot add every id number. – table_101 May 02 '19 at 06:05

score 0 · Answer 2 · answered May 02 '19 at 06:54

Try

# Sort values by NUM
df.sort_values(by='NUM', ascending=False, inplace=True)

# Find which group len > 1
mask = df.groupby(['KEY', 'ID']).ID.transform('count') > 1

df.loc[mask, 'KEY'] = df[mask].KEY + '+' + df[mask].STAT

# Get duplicated KEY index
id_drop = df.loc[~mask].drop_duplicates('KEY', keep='last').index

# Drop the duplicated KEY
df.drop(index=id_drop, inplace=True)

df

    KEY STAT  NUM  ID
2  ab+D    D    8   1
1    cd    D    4   2
0  ab+L    L    3   1

score 0 · Answer 3 · answered May 27 '19 at 23:28

0

I found a good answer in counting the frequency of a value in a column of a dataframe here:

count the frequency that a value occurs in a dataframe column

Basically to extract duplicates in a dataframe try:

# create a count column
df['Count'] = df.groupby('your_col')['your_col'].transform('count')
# Filter out Count = 1
df = df[(df.Count > 1)]

answered May 27 '19 at 23:28

Arthur D. Howland

4,363
3
21
31

Just to extract duplicates, try: `df[df['KEY'].duplicated(keep=False)]` – ramesh Nov 25 '19 at 20:43

How to filter on the duplicate rows of a dataframe?

3 Answers3