0

I have a pandas df:

from collections import defaultdict
import pandas as pd

data = {'sample': ['R1', 'R1', 'R2', 'R3', 'R3'],
    'number': [1, 1, 1, 1, 2],
    'pos': [323, 323, 410, 71, 918],
    'type': ['a', 'b', 'a', 'a', 'c']}

vars = pd.DataFrame(data)

I want to remove rows where where the sample, number and pos fields exist in another row.

To do this I am incrementing the count of a defaultdict using the sample, number and pos fields as a key, and then removing rows where this count is > 1:

seen = defaultdict(int)
print vars

for index, variant in vars.iterrows():
    key = '_'.join([variant['sample'], str(variant['number']), str(variant['pos'])])
    seen[key] += 1
    if seen[key] > 1:
        print("Seen this before: %s" % key)
        vars.drop(index, inplace=True)

print vars

This works as expected but I feel like I am somewhat missing the point of pandas by iterating over rows like this. Is there a more panda-native way of achieving the same thing?

fugu
  • 6,417
  • 5
  • 40
  • 75

2 Answers2

0

You can use:

vars = vars.drop_duplicates()
ElmoVanKielmo
  • 10,907
  • 2
  • 32
  • 46
Nihal
  • 5,262
  • 7
  • 23
  • 41
0

You could try to use pandas.DataFrame.drop_duplicates().

filiphl
  • 921
  • 1
  • 8
  • 15