4

I have a dataframe with two columns: "Agent" and "Client" Each row corresponds to an interaction between an Agent and a client.

I want to keep only the rows if a client had interactions with at least 2 agents.

How can I do that?

cs95
  • 379,657
  • 97
  • 704
  • 746
Hannah Mse
  • 61
  • 1
  • 6
  • You should use group by in pandas to do this: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html – Menglong Li Sep 17 '17 at 12:06
  • Does this answer your question? [How do I get a list of all the duplicate items using pandas in python?](https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python) – Ryan Feb 22 '22 at 16:27

2 Answers2

7

Worth adding that now you can use df.duplicated()

df = df.loc[df.duplicated(subset='Agent', keep=False)]
Davis
  • 572
  • 5
  • 12
1

Use groupby and transform by value_counts.

df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]

Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:

df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]

print(df)
   A  B
0  1  2
1  2  5
2  3  1
3  4  1
4  5  5
5  6  1

mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0    False
1     True
2     True
3     True
4     True
5     True
Name: B, dtype: bool

df = df[mask]
print(df)
   A  B
1  2  5
2  3  1
3  4  1
4  5  5
5  6  1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • The OP doesn't mention it, but might want to cater for the case where there's repeated interactions with a single agent which would currently be considered (possibly incorrectly)... Might want to throw in `.drop_duplicates()` before grouping... – Jon Clements Sep 17 '17 at 12:22
  • @JonClements Sure, sounds reasonable. Added in. – cs95 Sep 17 '17 at 12:29