Hi I want to remove duplicate entries in single column in my datasheet. Is there a short way to do that when writing my output file? It is python3 script.
with open(input) as infile, open (output, 'w') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
for gg, poss, codee, ref, alt, *rest in reader:
gg = int(gg)
poss = int(poss)
writer.writerow([gg, poss, codee, d[group][poss-1], ref + ',' +alt] + rest)
In the last line, I want the column "ref + ',' + alt" to not have any duplicate values. With the above command, I have outputs like:
A,B,B,C
G,G,A,T
G,A,A
T,T
I want this to be:
A,B,C
G,A,T
G,A
T
Is there a short command that I can incorporate on the last line? or should I start a new set of command to do this? Please help me! Thank you.
EDIT with OrderedDict
from collections import OrderedDict
with open(notmatch) as infile, open (two, 'w') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
for gg, poss, codee, ref, alt, *rest in reader:
gg = int (gg)
poss = int(poss)
cls = ref + alt
clss = list(OrderedDict.fromkeys(cls))
writer.writerow([gg, poss, codee, d[gg][poss-1], clss] + rest)
So I used OrderedDict and it seems to give me output like the following for column "clss":
['A','B','C']
['G','A','T']
['G','A']
['T']
The 5th column which is concat of "ref" and "alt" is the only column where I want to apply this deduplication so I wrote my script in such manner. Everything looks good but there are brackets "[" "] and apostrophe "'" in each cell. How should i modify my code so they are not there?