Remove rows that share multiple columns in pandas dataframe

Question

I have a pandas df:

from collections import defaultdict
import pandas as pd

data = {'sample': ['R1', 'R1', 'R2', 'R3', 'R3'],
    'number': [1, 1, 1, 1, 2],
    'pos': [323, 323, 410, 71, 918],
    'type': ['a', 'b', 'a', 'a', 'c']}

vars = pd.DataFrame(data)

I want to remove rows where where the sample, number and pos fields exist in another row.

To do this I am incrementing the count of a defaultdict using the sample, number and pos fields as a key, and then removing rows where this count is > 1:

seen = defaultdict(int)
print vars

for index, variant in vars.iterrows():
    key = '_'.join([variant['sample'], str(variant['number']), str(variant['pos'])])
    seen[key] += 1
    if seen[key] > 1:
        print("Seen this before: %s" % key)
        vars.drop(index, inplace=True)

print vars

This works as expected but I feel like I am somewhat missing the point of pandas by iterating over rows like this. Is there a more panda-native way of achieving the same thing?

Sorry - I wasn't clear in the question. It's not as simple as removing duplicates as other columns in the df may have differing values. See update. — fugu, Jul 10 '18 at 11:50
Sorry, then need `df.drop_duplicates(subset=['sample', 'number', 'pos'])` — jezrael, Jul 10 '18 at 11:51

score 0 · Answer 1 · edited Jul 10 '18 at 12:18

0

You can use:

vars = vars.drop_duplicates()

edited Jul 10 '18 at 12:18

ElmoVanKielmo

10,907
2
32
46

answered Jul 10 '18 at 11:45

Nihal

5,262
7
23
41

score 0 · Answer 2 · answered Jul 10 '18 at 11:46

0

You could try to use pandas.DataFrame.drop_duplicates().

answered Jul 10 '18 at 11:46

filiphl

921
1
8
15

Remove rows that share multiple columns in pandas dataframe

2 Answers2