Remove multiples of character sequence from string

Question

If I had a string like so:

my_string = 'this is is is is a string'

How would I remove the multiple iss so that only one will show?

This string could contain any number of is in there such as

my_string = 'this is is a string'
other_string = 'this is is is is is is is is a string'

A regex solution would be possible I suppose however I'm not sure how to go about it. Thanks.

Count the occurance of "is" string and keep on deleting duplicate strings whenever counter >1 — Mohit Sharma, Apr 13 '16 at 17:12
@MohitSharma Surely there must be a more efficient solution? — Pav Sidhu, Apr 13 '16 at 17:15
You want to remove only `is` or any duplicate occurrences? Like `'this is is is is a a a string string'` to `'this is a string'`. — Muhammad Tahir, Apr 13 '16 at 17:15
Related: http://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words. — alecxe, Apr 13 '16 at 17:24
Please note that this is a question and answer site, not a code writing service. If you [edit] your question to describe what you have tried so far and where you are stuck, then we can try to help with specific problems. You should also read [ask]. — Toby Speight, Apr 13 '16 at 17:36

score 1 · Answer 1 · answered Apr 13 '16 at 17:21

1

You can use itertools.groupby

from itertools import groupby
a = 'this is is is is a a a string string a a a'
print ' '.join(word for word, _ in groupby(a.split(' ')))

answered Apr 13 '16 at 17:21

Muhammad Tahir

5,006
1
19
36

Quinn · Answer 2 · 2016-04-13T18:11:42.620

1

Here is my approach:

my_string = 'this is is a string'
other_string = 'this is is is is is is is is a string'
def getStr(s):
    res = []
    [res.append(i) for i in s.split() if i not in res]
    return ' '.join(res)

print getStr(my_string)
print getStr(other_string)

Output:

this is a string
this is a string

UPDATE The regex way to attack it:

import re
print ' '.join(re.findall(r'(?:^|)(\w+)(?:\s+\1)*', other_string))

LIVE DEMO

edited Apr 13 '16 at 18:11

answered Apr 13 '16 at 17:23

Quinn

4,394
2
21
19

Your non-regex approach will remove *any* subsequent occurrences of any word in the string: `getStr('this string is a string')` --> `'this string is a'`. While the question is unclear, I think this is probably not what the OP has in mind. – Henry Keiter Apr 13 '16 at 18:25
You are right. It's good for the OP's question. The regex way is more reliable I think. – Quinn Apr 13 '16 at 18:28

score 0 · Answer 3 · answered Apr 13 '16 at 17:15

If you would like to remove all duplicates after one another, you can try

l = my_string.split()
tmp = [l[0]]
for word in l:
    if word != tmp[-1]:
        tmp.append(word)
s = ''
for word in tmp:
    s += word + ' '
my_string = s

of course, if you want it smarter than this, it is going to be more complicated.

score 0 · Answer 4 · answered Apr 13 '16 at 17:21

0

For oneliners:

>>> import itertools
>>> my_string = 'this is is a string'
>>> " ".join([k for k, g in itertools.groupby(my_string.split())])
'this is a string'

answered Apr 13 '16 at 17:21

Robert

33,429
8
90
94

Jan · Accepted Answer · 2016-04-13T18:24:11.207

Regex to the rescue!

((\b\w+\b)\s*\2\s*)+
# capturing group
# inner capturing group
# ... consisting of a word boundary, at least ONE word character and another boundary
# followed by whitespaces
# and the formerly captured group (aka the inner group)
# the whole pattern needs to be present at least once, but can be there
# multiple times

Python Code

import re

string = """
this is is is is is is is is a string
and here is another another another another example
"""
rx = r'((\b\w+\b)\s*\2\s*)+'

string = re.sub(rx, r'\2 ', string)
print string
# this is a string
# and here is another example

Demos

See a demo for this approach on regex101.com as well as on ideone.com

Remove multiples of character sequence from string

5 Answers5

Regex to the rescue!

Python Code

Demos