How to get unique values from a string without removing the delimiter

Question

I have to remove duplicate values from a string,in which child values are separated by a delimiter. My sample string is like "aa~*yt~*cc~*aa" where ~* is the delimiter and need to remove duplcate occurence of aa

I Tried using set cmmand and below code also, but they are giving output as

"a~*ytc"

However I need the output :

"aa~*yt~*cc"


d = {}
s="aa~*yt~*cc~*aa"
res=[]
for c in s:
    if c not in d:
      res.append(c)
      d[c]=1
print ("".join(res))

I have gone through many answers provided, but could not able to solve this. Please let me if there is any solution to it. Thanks and really appreciate your time :)

but "aa" is not duplicate to "*aa" according to your logic (looking at your code) — Grzegorz Skibinski, Sep 02 '19 at 14:27
What about the asterisk? at the moment we have `aa` and `*aa`. — Dan, Sep 02 '19 at 14:27

yatu · Answer 1 · 2019-09-02T14:45:32.260

2

You could split the string by the separator, take the set of the resulting list (to remove duplicates), sort the elements according to the order of appearance in the original string and join setting again ~ as a delimiter:

s = "aa~*yt~*cc~aa"

'~'.join(sorted(set(s.split('~')), key=s.index))
# 'aa~*yt~*cc'

If performance is important, define the dictionary used to sort the resulting set beforehand:

l = s.split('~')
length = len(l)
d = {j:length-i for i,j in enumerate(l[::-1])}
# {'aa': 1, '*cc': 3, '*yt': 2}
'~'.join(sorted(set(l), key=lambda x: d[x]))
# 'aa~*yt~*cc'

edited Sep 02 '19 at 14:45

answered Sep 02 '19 at 14:30

yatu

86,083
12
84
139

1

sorted([...], key=s.index) is an interesting solution, thanks for sharing. – Andreas Sep 02 '19 at 14:56

score 1 · Answer 2 · answered Sep 02 '19 at 14:28

Is the order of the substrings relevant?

if order is not important:

print("~".join(set("aa~*yt~*cc~aa".split("~"))))

if the order is important:

#f7 function source: https://stackoverflow.com/a/480227/11971785
def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

print("~".join(f7("aa~*yt~*cc~aa".split("~"))))

score 1 · Answer 3 · answered Sep 02 '19 at 14:28

You can use enumerate with re.findall:

import re
d = "aa~*yt~*cc~aa" 
new_d = re.findall('\w+|[\W]', d)
r, c = [a for i, a in enumerate(new_d) if a.isalpha() and a not in new_d[:i]], iter([i for i in new_d if not i.isalpha()])
result = ''.join(f'{a}{next(c)}{next(c)}' if i < len(r) - 1 else a for i, a in enumerate(r))

Output:

'aa~*yt~*cc'

With re.findall, the delimiter characters do not need to be known in advance.

score 1 · Answer 4 · answered Sep 02 '19 at 14:31

1

One common way to ensure uniqueness while maintaining order (in all Python variants) uses a collections.OrderedDict:

from collections import OrderedDict as OD

s = "aa~*yt~*cc~aa"
sep = "~"

uinq = sep.join(OD.fromkeys(s.split(sep)))
# 'aa~*yt~*cc'

answered Sep 02 '19 at 14:31

user2390182

72,016
6
67
89

score 1 · Answer 5 · answered Sep 02 '19 at 14:34

1

Try this one:

>>> s="aa~*yt~*cc~aa"
>>> s_list=s.split("~")
>>> s_final = "~".join([s_list[i] for i in range(len(s_list)) if s_list[0:i].count(s_list[i])==0])
>>> s_final
'aa~*yt~*cc'

answered Sep 02 '19 at 14:34

Grzegorz Skibinski

12,624
2
11
34

score 1 · Answer 6 · answered Sep 02 '19 at 15:12

Since python 3.7 dicts are ordered, so you can use them

>>> '~'.join(dict.fromkeys("aa~yt~cc~aa".split('~')).keys())
'aa~yt~cc'

for other python versions you can use this solution https://stackoverflow.com/a/57758708/7851254

However, i wouldn't recommend to use such unobvious feature. You can stick to some other answer, just choose one that is understandable from first look

How to get unique values from a string without removing the delimiter

6 Answers6