How to remove duplicate letters in a comma separated cell

Question

Hi I want to remove duplicate entries in single column in my datasheet. Is there a short way to do that when writing my output file? It is python3 script.

with open(input) as infile, open (output, 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    for gg, poss, codee, ref, alt, *rest in reader:
        gg = int(gg)
        poss = int(poss)
        writer.writerow([gg, poss, codee, d[group][poss-1], ref + ',' +alt] + rest)

In the last line, I want the column "ref + ',' + alt" to not have any duplicate values. With the above command, I have outputs like:

A,B,B,C
G,G,A,T
G,A,A
T,T

I want this to be:

A,B,C
G,A,T
G,A
T

Is there a short command that I can incorporate on the last line? or should I start a new set of command to do this? Please help me! Thank you.

EDIT with OrderedDict

from collections import OrderedDict
with open(notmatch) as infile, open (two, 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    for gg, poss, codee, ref, alt, *rest in reader:
        gg = int (gg)
        poss = int(poss)
        cls = ref + alt
        clss = list(OrderedDict.fromkeys(cls))
        writer.writerow([gg, poss, codee, d[gg][poss-1], clss] + rest)

So I used OrderedDict and it seems to give me output like the following for column "clss":

['A','B','C']
['G','A','T']
['G','A']
['T']

The 5th column which is concat of "ref" and "alt" is the only column where I want to apply this deduplication so I wrote my script in such manner. Everything looks good but there are brackets "[" "] and apostrophe "'" in each cell. How should i modify my code so they are not there?

Yes the order is important. For A,B,B,C,B, it would be A,B,C. However, my dataset only has from 1 letter to 3 letters. — user3546860, Nov 07 '14 at 04:30
If order is important then my below solution will not work, sorry :) — Nilesh, Nov 07 '14 at 04:32
@user3546860 Then `writer.writerow(list(OrderedDict.fromkeys(yourlist)))` should do the trick. — Ashwini Chaudhary, Nov 07 '14 at 04:35
So I tried to use OrderedDict but I have a minor problem. Can you take a look at OP? Thanks — user3546860, Nov 07 '14 at 06:25

score 2 · Answer 1 · answered Nov 07 '14 at 04:31

2

You have to split string with comma

>>> a = "A,B,B,C"
>>> a.split(',')
['A', 'B', 'B', 'C']

Then use set to get unique values.

>>> set(a.split(','))
set(['A', 'C', 'B'])

And join again with comma

>>> ','.join(set(a.split(',')))
'A,C,B'

answered Nov 07 '14 at 04:31

Nilesh

20,521
16
92
148

How to remove duplicate letters in a comma separated cell

1 Answers1