3

I have to remove duplicate values from a string,in which child values are separated by a delimiter. My sample string is like "aa~*yt~*cc~*aa" where ~* is the delimiter and need to remove duplcate occurence of aa

I Tried using set cmmand and below code also, but they are giving output as

"a~*ytc"

However I need the output :

"aa~*yt~*cc"

d = {}
s="aa~*yt~*cc~*aa"
res=[]
for c in s:
    if c not in d:
      res.append(c)
      d[c]=1
print ("".join(res))

I have gone through many answers provided, but could not able to solve this. Please let me if there is any solution to it. Thanks and really appreciate your time :)

one
  • 2,205
  • 1
  • 15
  • 37
ankit
  • 61
  • 5

6 Answers6

2

You could split the string by the separator, take the set of the resulting list (to remove duplicates), sort the elements according to the order of appearance in the original string and join setting again ~ as a delimiter:

s = "aa~*yt~*cc~aa"

'~'.join(sorted(set(s.split('~')), key=s.index))
# 'aa~*yt~*cc'

If performance is important, define the dictionary used to sort the resulting set beforehand:

l = s.split('~')
length = len(l)
d = {j:length-i for i,j in enumerate(l[::-1])}
# {'aa': 1, '*cc': 3, '*yt': 2}
'~'.join(sorted(set(l), key=lambda x: d[x]))
# 'aa~*yt~*cc'
yatu
  • 86,083
  • 12
  • 84
  • 139
1

Is the order of the substrings relevant?

if order is not important:

print("~".join(set("aa~*yt~*cc~aa".split("~"))))

if the order is important:

#f7 function source: https://stackoverflow.com/a/480227/11971785
def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

print("~".join(f7("aa~*yt~*cc~aa".split("~"))))
Andreas
  • 8,694
  • 3
  • 14
  • 38
1

You can use enumerate with re.findall:

import re
d = "aa~*yt~*cc~aa" 
new_d = re.findall('\w+|[\W]', d)
r, c = [a for i, a in enumerate(new_d) if a.isalpha() and a not in new_d[:i]], iter([i for i in new_d if not i.isalpha()])
result = ''.join(f'{a}{next(c)}{next(c)}' if i < len(r) - 1 else a for i, a in enumerate(r))

Output:

'aa~*yt~*cc'

With re.findall, the delimiter characters do not need to be known in advance.

Ajax1234
  • 69,937
  • 8
  • 61
  • 102
1

One common way to ensure uniqueness while maintaining order (in all Python variants) uses a collections.OrderedDict:

from collections import OrderedDict as OD

s = "aa~*yt~*cc~aa"
sep = "~"

uinq = sep.join(OD.fromkeys(s.split(sep)))
# 'aa~*yt~*cc'
user2390182
  • 72,016
  • 6
  • 67
  • 89
1

Try this one:

>>> s="aa~*yt~*cc~aa"
>>> s_list=s.split("~")
>>> s_final = "~".join([s_list[i] for i in range(len(s_list)) if s_list[0:i].count(s_list[i])==0])
>>> s_final
'aa~*yt~*cc'
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
1

Since python 3.7 dicts are ordered, so you can use them

>>> '~'.join(dict.fromkeys("aa~yt~cc~aa".split('~')).keys())
'aa~yt~cc'

for other python versions you can use this solution https://stackoverflow.com/a/57758708/7851254

However, i wouldn't recommend to use such unobvious feature. You can stick to some other answer, just choose one that is understandable from first look

Alexandr Zayets
  • 299
  • 1
  • 8