0

I want to drop rows in a pandas dataframe where value in one column A is duplicate and value in some other column B is not a duplicate given A. An illustrative example:

df = pd.DataFrame({'A': ['cat', 'dog', 'cat', 'cat', 'bat'],
                   'B': ['x', 'y', 'x', 'z', 'z'],
                   'C': [10, 20, 30, 40, 50]})

enter image description here

The desired output has Row 3 dropped as "cat" is duplicate and "z" is distinct in "cat" rows:

enter image description here

df.drop_duplicates('A') removes Rows 2 and 3, and df.drop_duplicates(subset=['A', 'B']) removes Row 2; neither is what I am looking for here [1].

Reveille
  • 4,359
  • 3
  • 23
  • 46

2 Answers2

2

Maybe this:

df[(~df.duplicated('A')) | df.duplicated(['A','B'])]

Output:

     A  B   C
0  cat  x  10
1  dog  y  20
2  bat  z  30
3  cat  x  40
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • It seems to work. I try it on the original dataset too to see it it works there too. Thanks! (also, I had a typo in the question code that I just corrected) – Reveille Dec 13 '19 at 19:59
2

IIUC transform

df[df.B==df.groupby('A').B.transform('first')]
Out[11]: 
     A  B   C
0  cat  x  10
1  dog  y  20
2  cat  x  30
4  bat  z  50
BENY
  • 317,841
  • 20
  • 164
  • 234