Retain only duplicated rows in a pandas dataframe

Question

I have a dataframe with two columns: "Agent" and "Client" Each row corresponds to an interaction between an Agent and a client.

I want to keep only the rows if a client had interactions with at least 2 agents.

How can I do that?

You should use group by in pandas to do this: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html — Menglong Li, Sep 17 '17 at 12:06
Does this answer your question? [How do I get a list of all the duplicate items using pandas in python?](https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python) — Ryan, Feb 22 '22 at 16:27

score 7 · Answer 1 · answered Mar 09 '22 at 16:05

7

Worth adding that now you can use df.duplicated()

df = df.loc[df.duplicated(subset='Agent', keep=False)]

answered Mar 09 '22 at 16:05

Davis

572
5
12

This works perfectly, thanks! – Prakhar Rathi Apr 19 '22 at 17:42

cs95 · Accepted Answer · 2017-09-17T12:29:07.200

1

Use groupby and transform by value_counts.

df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]

Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:

df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]

print(df)
   A  B
0  1  2
1  2  5
2  3  1
3  4  1
4  5  5
5  6  1

mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0    False
1     True
2     True
3     True
4     True
5     True
Name: B, dtype: bool

df = df[mask]
print(df)
   A  B
1  2  5
2  3  1
3  4  1
4  5  5
5  6  1

edited Sep 17 '17 at 12:29

answered Sep 17 '17 at 12:05

cs95

379,657
97
704
746

The OP doesn't mention it, but might want to cater for the case where there's repeated interactions with a single agent which would currently be considered (possibly incorrectly)... Might want to throw in `.drop_duplicates()` before grouping... – Jon Clements Sep 17 '17 at 12:22
@JonClements Sure, sounds reasonable. Added in. – cs95 Sep 17 '17 at 12:29

Retain only duplicated rows in a pandas dataframe

2 Answers2

Related