1

Hi I want to remove duplicate entries in single column in my datasheet. Is there a short way to do that when writing my output file? It is python3 script.

with open(input) as infile, open (output, 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    for gg, poss, codee, ref, alt, *rest in reader:
        gg = int(gg)
        poss = int(poss)
        writer.writerow([gg, poss, codee, d[group][poss-1], ref + ',' +alt] + rest)

In the last line, I want the column "ref + ',' + alt" to not have any duplicate values. With the above command, I have outputs like:

A,B,B,C
G,G,A,T
G,A,A
T,T

I want this to be:

A,B,C
G,A,T
G,A
T

Is there a short command that I can incorporate on the last line? or should I start a new set of command to do this? Please help me! Thank you.

EDIT with OrderedDict

from collections import OrderedDict
with open(notmatch) as infile, open (two, 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    for gg, poss, codee, ref, alt, *rest in reader:
        gg = int (gg)
        poss = int(poss)
        cls = ref + alt
        clss = list(OrderedDict.fromkeys(cls))
        writer.writerow([gg, poss, codee, d[gg][poss-1], clss] + rest)

So I used OrderedDict and it seems to give me output like the following for column "clss":

['A','B','C']
['G','A','T']
['G','A']
['T']

The 5th column which is concat of "ref" and "alt" is the only column where I want to apply this deduplication so I wrote my script in such manner. Everything looks good but there are brackets "[" "] and apostrophe "'" in each cell. How should i modify my code so they are not there?

user3546860
  • 137
  • 8

1 Answers1

2

You have to split string with comma

>>> a = "A,B,B,C"
>>> a.split(',')
['A', 'B', 'B', 'C']

Then use set to get unique values.

>>> set(a.split(','))
set(['A', 'C', 'B'])

And join again with comma

>>> ','.join(set(a.split(',')))
'A,C,B'
Nilesh
  • 20,521
  • 16
  • 92
  • 148