0

I have a dataframe like this:

id    company    ......
111   A
222   B
333   B
111   E
444   C
555   C
555   C
333   A
111   A
222   D
444   C

and I would like to get the rows where the id occurs in the same company at least twice. So the result would be:

id    company    .......
111   A
444   C
555   C
555   C
111   A
444   C

Although id 222 was there twice it was with a different company so it is removed. id 111 was there 3 times but only twice with the same company. So only the 2 rows from that company remain. And so on.

Rows can occur with the same company more than twice.

There are some stackoverflow questions which deal with selecting rows where a value appears more than once (such as How to select rows in Pandas dataframe where value appears more than once) but I cannot find any that deal with an index + column pair occuring more than once.

daragh
  • 487
  • 9
  • 18

1 Answers1

1

Use duplicated:

import pandas as pd

df = pd.DataFrame({'id': {0: 111, 1: 222, 2: 333, 3: 111, 4: 444, 5: 555, 6: 555, 7: 333, 8: 111, 9: 222, 10: 444},
                   'company': {0: 'A', 1: 'B', 2: 'B', 3: 'E', 4: 'C', 5: 'C', 6: 'C', 7: 'A', 8: 'A', 9: 'D', 10: 'C'}})

print (df[df.duplicated(subset=["id","company"],keep=False)])

#
     id company
0   111       A
4   444       C
5   555       C
6   555       C
8   111       A
10  444       C
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • 1
    TS was mentioning about index not particular columns in dataframe – AnswerSeeker Oct 07 '19 at 10:09
  • I had to reset the index beforehand (`df.reset_index(inplace=True)`) but it worked then. Also used `df.set_index('index', inplace=True)` to get my original index back – daragh Oct 07 '19 at 10:11