I have a pandas df:
from collections import defaultdict
import pandas as pd
data = {'sample': ['R1', 'R1', 'R2', 'R3', 'R3'],
'number': [1, 1, 1, 1, 2],
'pos': [323, 323, 410, 71, 918],
'type': ['a', 'b', 'a', 'a', 'c']}
vars = pd.DataFrame(data)
I want to remove rows where where the sample
, number
and pos
fields exist in another row.
To do this I am incrementing the count of a defaultdict
using the sample
, number
and pos
fields as a key, and then removing rows where this count is > 1
:
seen = defaultdict(int)
print vars
for index, variant in vars.iterrows():
key = '_'.join([variant['sample'], str(variant['number']), str(variant['pos'])])
seen[key] += 1
if seen[key] > 1:
print("Seen this before: %s" % key)
vars.drop(index, inplace=True)
print vars
This works as expected but I feel like I am somewhat missing the point of pandas by iterating over rows like this. Is there a more panda-native way of achieving the same thing?