11

There is a string like it

"dxabcabcyyyydxycxcxz"

and I want to merge it into

"dxabcydxycxz"

Other examples: "ddxddx" -> "dxdx" , "abbab" -> "abab".

The rule is that :

if (adjacent and same): merge

# Such as 'abc', they are same, so delete one of them
# Although 'dx' is same as 'dx', they are nonadjacent, so do not delete any of them
# If one character has been deleted, don't delete any substring, include it

I've done it in Python, but it's slow when applied to a long string.

# Original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] * str_len  # Use a list to mark which char is deleted

# Enumerate the size of substring
for i in range(1,str_len):
    # Enumerate the begin of the substring
    for j in range(0, str_len):
        offset = 2 #the size of sub-str + 1
        current_sub_str = mystr[j:j+i]
        s_begin = j+i*(offset-1)
        s_end = j+(i*offset)
        # Delete all of the same char
        while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
              and 0  not in vis[s_begin:s_end] and 0  not in vis[j:j+i]):
            vis[s_begin:s_end] = [0] * (s_end - s_begin)  # If it was deleted, mark it as 0
            offset += 1
            s_begin = j + i * (offset - 1)
            s_end = j + (i * offset)

res = []
for i in range(0,str_len):
    if(vis[i]!=0): res.append(mystr[i])

print "".join(res)
   

Is there any faster way to solve it?

Update April 29, 2017

Sorry, it seems to be like an XY problem. On the other hand, it maybe not. There is the content I was coding for a web spider and got many 'tag-path's like those:

ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span 
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a

As you see, some of the 'tag-path's are identical, so I wanted to collapse them to find out if there is any other 'tag-path's with the same structure.

After collapsing, I get the 'tag-path' like this.

ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a

This is only my idea and I didn't know whether it is suitable to do this way. (After trying, I chose another way to do it).

However there is an interesting question like an ACM question.

So, I simplify one 'tag-path' to a character and ask for help. Because I didn't do a fast way by myself. Actually, the question has many corner cases that I don't mind and thank all for helping me to complete it.

Thanks all.


NickS1
  • 496
  • 1
  • 6
  • 19
haomao
  • 139
  • 2
  • 10
  • 5
    what happen if you find "ddxddx" ? Would you end up with "ddx" or "dxdx" ? – Trolldejo Apr 28 '17 at 09:24
  • 3
    And how about `abbab` - does it become `ab` or `abab`? – dimo414 Apr 28 '17 at 09:25
  • Sorry, the expectation of 'ddxddx' is 'dxdx' and 'abbab' is 'abab'. I find some problems in my code. – haomao Apr 28 '17 at 09:38
  • If `ddxddx` becomes `dxdx` I don't see how `yyyy` becomes `y` and not `yy` – silel Apr 28 '17 at 10:44
  • 3
    Why is `abcabc` contracted to `abc` but not `ddxddx` to `ddx`? Shouldn’t this algorithm look for the longest possible substring that gets repeated? – poke Apr 28 '17 at 11:00
  • @silel, I'm guessing `yyyy` gets replaced as one operation. – Holloway Apr 28 '17 at 11:22
  • @ poke, Yes. "yyyy" will be collapsed in the one operation by one character. But "ddxddx" don't, because when we get "dxdx", we won't go on, "d" has been merged. – haomao Apr 28 '17 at 12:40
  • @Holloway thank you. I am sorry for I didn't describe the problem clearly – haomao Apr 28 '17 at 12:44
  • Just to be sure, should "abadabad" be compressed into "abad"? Could you also explain the point of your algorithm just to make sure this is not some kind of [XY problem](https://meta.stackexchange.com/q/66377)? Give us some context? – silel Apr 28 '17 at 15:10
  • Another corner case: `abbcabbc`. My opinion, based on your previous examples, is that you want it to become `abcabc`. If so none of the answer (right now) seems to yield the correct output (except yours). – silel Apr 28 '17 at 16:02
  • Sorry, it seems to like a XY problem.On the other hand,it maybe not. there is the content I was coding for a web spider and got many 'tag-path's like those `ul/li/a ul/li/div/div/div/a/span ul/li/div/div/div/a/span ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a ul/li/ul/li/a` – haomao Apr 29 '17 at 06:06
  • I update the question. "abbcabbc "-> "abcabc" and "abadabad"->"abad" – haomao Apr 29 '17 at 06:10
  • I do not quite understand what it is you want to do... I get that you are working on an html document. What does each line/"tag-path" represent? Why do you want to collapse them? – silel May 02 '17 at 09:35
  • Your input/output suggest you could have a look at what [`uniq`](https://linux.die.net/man/1/uniq) does (in bash). – silel May 02 '17 at 09:38

6 Answers6

16

Behold the power of regex:

>>> import re

>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'

>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'

>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'

This looks for a sequence of 1 or more arbitrary characters (.+?) (as a non-greedy match, so that it tries shorter sequences first), followed by 1 or more repetitions of the matched sequence \1+, and replaces it all with just the matched sequence \1.

jasonharper
  • 9,450
  • 2
  • 18
  • 42
  • Wow, I got the power of regex.However , As @silel 's case "abbcabbc" -> "abcabc". But this way only get "abbc" – haomao Apr 29 '17 at 06:27
0

This can be a start:

for i in range(len(string)):
    for j in range(i + 1, len(string)):
        while string[i:j] == string[j:j + j - i]:
            string = string[:j] + string[j + j - i:]

The result on the examples provided:

abbab  -> abab
ddxddx -> dxdx
abcabcabc -> abc
dxabcabcyyyydxycxcxz -> dxabcydxycxz
silel
  • 567
  • 2
  • 10
0

This is a great question/series of responses!

Here's an implementation using a generator and string slicing:

import math
def dedupe(string, step=1):
    index = 0
    prior = ''
    while index < len(string):
        letter = string[index]
        window = index + step
        comparison = string[index:window]
        if comparison != prior:
            yield letter
            prior += letter
            index += 1
        else:
            index += step
        if len(prior) > (step):
            prior = prior[1:] # remove first character


def collapse(string):
    step = 1
    while step < math.sqrt(len(string)):
        generator = dedupe(string, step=step)
        string = ''.join(generator)
        step +=1
    return string

Edit: changed the step search to use the square root of the length to improve search times:

  • %timeit collapse('dxabcabcyyyydxycxcxz') 10000 loops, best of 3: 24.7 µs per loop
  • %timeit collapse(randomword(100) 1000 loops, best of 3: 384 µs per loop
  • %timeit collapse("a" * 100) 10000 loops, best of 3: 27.1 µs per loop
  • %timeit collapse(randomword(50) * 2) 1000 loops, best of 3: 382 µs per loop
Thomas M
  • 11
  • 2
  • I am still waiting for OP's answer on this but your solution on `abadabad` seems to return `abadabad`? Can you confirm? – silel Apr 28 '17 at 15:56
  • @silel Confirmed, that test case breaks my implementation, thanks for the heads up! It would only work if I change the while loop condition back to step < len(string) but that slows down the algo significantly. – Thomas M Apr 28 '17 at 15:59
0

One line:

def remove_repeats(iterable):
    return [e for (i, e) in enumerate(iterable) if i == 0 or e != iterable[i - 1]]

It works with any iterable data, returns list.

>>> print remove_repeats("aaabbc")
['a', 'b', 'c']

>>> s = '''
... ul/li/a
... ul/li/div/div/div/a/span
... ul/li/div/div/div/a/span
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... '''

>>> print remove_repeats(s.split())
['ul/li/a', 'ul/li/div/div/div/a/span', 'ul/li/a', 'ul/li/ul/li/a', 'ul/li/a', '
ul/li/ul/li/a', 'ul/li/a', 'ul/li/ul/li/a']

Join if you need a string:

>>> print "".join(remove_repeats('111222333'))
123

>>> print "\n".join(remove_repeats(s.split()))
ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
0
from collections import OrderedDict
mystr = "dxabcabcyyyydxycxcxz"
index=0;indexs = [];count = OrderedDict()
while count!=None:
    count = {}
    for index in range(0,len(mystr)):
        flag = True
        for index1 in range(0,index+1)[::-1]:
            if(mystr.startswith(mystr[index1:index+1], index+1)):
                if count.get(str(index1),0)<(index+1-index1):
                    count.update({str(index1) : index+1-index1})
    for key in count:
        mystr = mystr[:int(key)]+mystr[int(key)+count[key]:]
    if count=={}:
        count=None
print "Answer:", mystr
Suraj Dubal
  • 27
  • 1
  • 2
0

One linear approach

import itertools
_str = 'dxabcabcyyyydxycxcxz'
print ''.join(ch for ch, _ in itertools.groupby(_str))

result:

dxabcabcyyyydxycxcxz -> dxabcabcydxycxcxz

brc
  • 249
  • 2
  • 6