Is there a fast algorithm to remove repeated substrings in a string?

Question

There is a string like it

"dxabcabcyyyydxycxcxz"

and I want to merge it into

"dxabcydxycxz"

Other examples: "ddxddx" -> "dxdx" , "abbab" -> "abab".

The rule is that :

if (adjacent and same): merge

# Such as 'abc', they are same, so delete one of them
# Although 'dx' is same as 'dx', they are nonadjacent, so do not delete any of them
# If one character has been deleted, don't delete any substring, include it

I've done it in Python, but it's slow when applied to a long string.

# Original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] * str_len  # Use a list to mark which char is deleted

# Enumerate the size of substring
for i in range(1,str_len):
    # Enumerate the begin of the substring
    for j in range(0, str_len):
        offset = 2 #the size of sub-str + 1
        current_sub_str = mystr[j:j+i]
        s_begin = j+i*(offset-1)
        s_end = j+(i*offset)
        # Delete all of the same char
        while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
              and 0  not in vis[s_begin:s_end] and 0  not in vis[j:j+i]):
            vis[s_begin:s_end] = [0] * (s_end - s_begin)  # If it was deleted, mark it as 0
            offset += 1
            s_begin = j + i * (offset - 1)
            s_end = j + (i * offset)

res = []
for i in range(0,str_len):
    if(vis[i]!=0): res.append(mystr[i])

print "".join(res)

Is there any faster way to solve it?

Update April 29, 2017

Sorry, it seems to be like an XY problem. On the other hand, it maybe not. There is the content I was coding for a web spider and got many 'tag-path's like those:

ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span 
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a

As you see, some of the 'tag-path's are identical, so I wanted to collapse them to find out if there is any other 'tag-path's with the same structure.

After collapsing, I get the 'tag-path' like this.

ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a

This is only my idea and I didn't know whether it is suitable to do this way. (After trying, I chose another way to do it).

However there is an interesting question like an ACM question.

So, I simplify one 'tag-path' to a character and ask for help. Because I didn't do a fast way by myself. Actually, the question has many corner cases that I don't mind and thank all for helping me to complete it.

Thanks all.

what happen if you find "ddxddx" ? Would you end up with "ddx" or "dxdx" ? — Trolldejo, Apr 28 '17 at 09:24
Sorry, the expectation of 'ddxddx' is 'dxdx' and 'abbab' is 'abab'. I find some problems in my code. — haomao, Apr 28 '17 at 09:38
If `ddxddx` becomes `dxdx` I don't see how `yyyy` becomes `y` and not `yy` — silel, Apr 28 '17 at 10:44
Why is `abcabc` contracted to `abc` but not `ddxddx` to `ddx`? Shouldn’t this algorithm look for the longest possible substring that gets repeated? — poke, Apr 28 '17 at 11:00
@ poke, Yes. "yyyy" will be collapsed in the one operation by one character. But "ddxddx" don't, because when we get "dxdx", we won't go on, "d" has been merged. — haomao, Apr 28 '17 at 12:40
@Holloway thank you. I am sorry for I didn't describe the problem clearly — haomao, Apr 28 '17 at 12:44
Just to be sure, should "abadabad" be compressed into "abad"? Could you also explain the point of your algorithm just to make sure this is not some kind of [XY problem](https://meta.stackexchange.com/q/66377)? Give us some context? — silel, Apr 28 '17 at 15:10
Another corner case: `abbcabbc`. My opinion, based on your previous examples, is that you want it to become `abcabc`. If so none of the answer (right now) seems to yield the correct output (except yours). — silel, Apr 28 '17 at 16:02
Sorry, it seems to like a XY problem.On the other hand,it maybe not. there is the content I was coding for a web spider and got many 'tag-path's like those `ul/li/a ul/li/div/div/div/a/span ul/li/div/div/div/a/span ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a` — haomao, Apr 29 '17 at 06:06
I update the question. "abbcabbc "-> "abcabc" and "abadabad"->"abad" — haomao, Apr 29 '17 at 06:10
I do not quite understand what it is you want to do... I get that you are working on an html document. What does each line/"tag-path" represent? Why do you want to collapse them? — silel, May 02 '17 at 09:35
Your input/output suggest you could have a look at what [`uniq`](https://linux.die.net/man/1/uniq) does (in bash). — silel, May 02 '17 at 09:38

score 16 · Answer 1 · answered Apr 28 '17 at 13:20

16

Behold the power of regex:

>>> import re

>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'

>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'

>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'

This looks for a sequence of 1 or more arbitrary characters (.+?) (as a non-greedy match, so that it tries shorter sequences first), followed by 1 or more repetitions of the matched sequence \1+, and replaces it all with just the matched sequence \1.

answered Apr 28 '17 at 13:20

jasonharper

9,450
2
18
42

Wow, I got the power of regex.However , As @silel 's case "abbcabbc" -> "abcabc". But this way only get "abbc" – haomao Apr 29 '17 at 06:27

silel · Answer 2 · 2017-04-28T13:17:24.973

0

This can be a start:

for i in range(len(string)):
    for j in range(i + 1, len(string)):
        while string[i:j] == string[j:j + j - i]:
            string = string[:j] + string[j + j - i:]

The result on the examples provided:

abbab  -> abab
ddxddx -> dxdx
abcabcabc -> abc
dxabcabcyyyydxycxcxz -> dxabcydxycxz

edited Apr 28 '17 at 13:17

answered Apr 28 '17 at 10:52

silel

567
2
10

In both cases, the repetitions of 1 character get squeezed. Then you would have to consider the repetitions of blocks of 2 characters, and so on. – Thierry Lathuille Apr 28 '17 at 11:14
With that rationale you transform `ddxddx` into `ddx`, don't you? – silel Apr 28 '17 at 12:20
No. First dd -> d, and dd ->d. So you get dxdx. – Thierry Lathuille Apr 28 '17 at 12:26
Thank you. In my opinion yyyy -> y, but in the "ddxddx",the character 'x' break sequence about "dddd". – haomao Apr 28 '17 at 12:54

Thomas M · Answer 3 · 2017-04-28T15:57:36.143

This is a great question/series of responses!

Here's an implementation using a generator and string slicing:

import math
def dedupe(string, step=1):
    index = 0
    prior = ''
    while index < len(string):
        letter = string[index]
        window = index + step
        comparison = string[index:window]
        if comparison != prior:
            yield letter
            prior += letter
            index += 1
        else:
            index += step
        if len(prior) > (step):
            prior = prior[1:] # remove first character


def collapse(string):
    step = 1
    while step < math.sqrt(len(string)):
        generator = dedupe(string, step=step)
        string = ''.join(generator)
        step +=1
    return string

Edit: changed the step search to use the square root of the length to improve search times:

%timeit collapse('dxabcabcyyyydxycxcxz') 10000 loops, best of 3: 24.7 µs per loop
%timeit collapse(randomword(100) 1000 loops, best of 3: 384 µs per loop
%timeit collapse("a" * 100) 10000 loops, best of 3: 27.1 µs per loop
%timeit collapse(randomword(50) * 2) 1000 loops, best of 3: 382 µs per loop

I am still waiting for OP's answer on this but your solution on `abadabad` seems to return `abadabad`? Can you confirm? — silel, Apr 28 '17 at 15:56
@silel Confirmed, that test case breaks my implementation, thanks for the heads up! It would only work if I change the while loop condition back to step < len(string) but that slows down the algo significantly. — Thomas M, Apr 28 '17 at 15:59

score 0 · Answer 4 · answered Apr 30 '17 at 05:53

One line:

def remove_repeats(iterable):
    return [e for (i, e) in enumerate(iterable) if i == 0 or e != iterable[i - 1]]

It works with any iterable data, returns list.

>>> print remove_repeats("aaabbc")
['a', 'b', 'c']

>>> s = '''
... ul/li/a
... ul/li/div/div/div/a/span
... ul/li/div/div/div/a/span
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... '''

>>> print remove_repeats(s.split())
['ul/li/a', 'ul/li/div/div/div/a/span', 'ul/li/a', 'ul/li/ul/li/a', 'ul/li/a', '
ul/li/ul/li/a', 'ul/li/a', 'ul/li/ul/li/a']

Join if you need a string:

>>> print "".join(remove_repeats('111222333'))
123

>>> print "\n".join(remove_repeats(s.split()))
ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a

score 0 · Answer 5 · answered May 02 '17 at 05:03

from collections import OrderedDict
mystr = "dxabcabcyyyydxycxcxz"
index=0;indexs = [];count = OrderedDict()
while count!=None:
    count = {}
    for index in range(0,len(mystr)):
        flag = True
        for index1 in range(0,index+1)[::-1]:
            if(mystr.startswith(mystr[index1:index+1], index+1)):
                if count.get(str(index1),0)<(index+1-index1):
                    count.update({str(index1) : index+1-index1})
    for key in count:
        mystr = mystr[:int(key)]+mystr[int(key)+count[key]:]
    if count=={}:
        count=None
print "Answer:", mystr

score 0 · Answer 6 · answered May 07 '17 at 07:43

0

One linear approach

import itertools
_str = 'dxabcabcyyyydxycxcxz'
print ''.join(ch for ch, _ in itertools.groupby(_str))

result:

dxabcabcyyyydxycxcxz -> dxabcabcydxycxcxz

answered May 07 '17 at 07:43

brc

249
2
6

Is there a fast algorithm to remove repeated substrings in a string?

6 Answers6

Linked