Replace two adjacent duplicate characters in string with the next character in alphabet

Question

I want to write a function, that will find the first occurrence of two adjacent characters that are the same, replace them with a single character that is next in the alphabet and go over the string until there are no duplicates left. In case of "zz" it should go in a circular fashion back to "a". The string can only include characters a-z, that is, no capital letters or non-alphabetical characters. I have written a function that does it, but it is not effective enough for a very long string.

def solve(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            r = s[i+1:]
            l = s[:i-1]
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            i = 1
            s = l+x+r
        else:
            i += 1
    return s

So for example if s = 'aabbbc' the function should work like aabbbc --> bbbbc --> cbbc --> ccc and finally return dc. How can I make it more efficient?

Edit: for example if s = 'ab'*10**4 + 'cc'*10**4 + 'dd'*10**4 this function is taking a lot of time.

@mkrieger1 For example, if s is something like 'ab'*10^4 + 'cc'*10^44 + 'dd'*10^44, this function takes a lot of time — user932895, Mar 09 '23 at 21:53
On a small note, the choice of which pair of duplicates to replace first is relevant. For instance, string `'aaa'` could become either `'ab'` or `'ba'` depending on whether the first `'aa'` or the second `'aa'` is replaced first. As another example, `'aabbc'` can go either `aabbc -> bbbc -> cbc` or `aabbc -> aacc -> bd` — Stef, Mar 09 '23 at 22:08
@Stef Yes, with "first occurrence of two adjacent characters that are the same" I meant that the one that should be looked for at each point should be the one that occurs first in the string — user932895, Mar 09 '23 at 22:16

Stef · Accepted Answer · 2023-03-09T22:16:38.313

2

As a trivial optimisation: instead of the hard reset i = 1, you can use a softer reset i = i-1. Indeed, if there was no duplicate between 1 and i before the transformation, then there still won't be any duplicate between 1 and i-1 after the transformation.

def solve(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            l = s[:i-1]
            r = s[i+1:]  # I swapped these two lines because I read left-to-right
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            s = l+x+r
            i = max(1, i-1)  # important change is here
        else:
            i += 1
    return s

s = 'ab'*10**4 + 'cc'*10**4 + 'dd'*10**4
t = solve(s)

print(t[2*10**4:])
# rqpnlih

print(t == 'ab'*10**4 + 'rqpnlih')
# True

edited Mar 09 '23 at 22:16

answered Mar 09 '23 at 22:04

Stef

13,242
2
17
28

1

Thank you this helped! With a string like `s = 'zzzz'` this first resulted in a index error, but I just wrote `i -= 1` inside a `if i > 1` and that solved the error too :) – user932895 Mar 09 '23 at 22:14
1

@user932895 Oops, yes, you're right. I edited accordingly. – Stef Mar 09 '23 at 22:15

Alain T. · Answer 2 · 2023-03-10T01:59:00.133

2

You can build the result in a list that you use like a stack, adding letters that are different from the last one and popping back with the next letter when they match:

def compact(s):
    result = []
    nextChar = str.maketrans({i+96:i%26+97 for i in range(1,27)})
    for c in s:
        while result and result[-1] == c:
            c = result.pop(-1).translate(nextChar)
        result.append(c)
    return "".join(result)

output:

print(compact("aabbbc"))
# dc

s = 'ab'*10**4 + 'cc'*10**4 + 'dd'*10**4
print(compact(s)[-11:])
# ababrqpnlih

edited Mar 10 '23 at 01:59

answered Mar 10 '23 at 01:28

Alain T.

40,517
4
31
51

I got the same final 11 characters for your second example, with the entire string returned containing 20,007 characters, suggesting both of our methods are working correctly. – Cary Swoveland Mar 10 '23 at 18:44

score 2 · Answer 3 · answered Mar 10 '23 at 02:55

The main causes of slowdown here are the O(N^2) behaviour resulting from:

Repeated slicing and re-creation of s (a new string has to be allocated and copied every time, because Python's strings are immutable).
Resetting iteration to the beginning of the newly-created s whenever a duplicate letter is found, such that it must read past all the non-duplicates again. For example, when reducing 'abcdeffhh', after combining the 'ff' into 'g', the next iteration will scan past all of that before considering the 'hh'. (This is already pointed out in Stef's answer.)

I thought of the same basic approach as Alain T., although I can offer some micro-optimizations. I first want to explain why this stack-based approach is so much faster:

.pop from the end of a list is O(1); the list is mutable, so it just needs to clear out a reference to the last object and update its element count. However, removing elements from anywhere else involves shifting all the following elements down to fill the gap.) Similarly, because the new elements are going on to a separate data structure, there's never a need to re-create the s string.
Iterating this way allows us to use the natural, Pythonic iteration without having to index back in through an auxiliary range object. Since only one element from the input is needed at a time (it will be compared to the top of the stack). (Normally, iterating over overlapping pairs of a list can also be done much more elegantly, but standard approaches assume that the input won't be modified. In general, algorithms that try to modify an input sequence while iterating over it are a bad idea and hard to get right.)
Finally, the actual replacement with "the next letter, looping around at z" can be simplified by preparing a lookup. The string type provides a static method maketrans which can create such a lookup easily; it's suitable for use by the translate method of strings, which applies the lookup to each character of the string. However, since we are only working with single-character strings for this part, it would equally work to build a dictionary that directly maps letters to letters, and use it normally. (str.maketrans produces a dictionary, but it maps integers - the results of ord, basically - to other integers or None.)

Of course, "practicality beats purity", so none of this matters without testing. I made a file q75690334.py to test the performance of all these approaches:

def solve_user932895(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            r = s[i+1:]
            l = s[:i-1]
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            i = 1
            s = l+x+r
        else:
            i += 1
    return s

def solve_stef(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            l = s[:i-1]
            r = s[i+1:]
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            s = l+x+r
            i = max(1, i-1)
        else:
            i += 1
    return s

def solve_alain(s):
    result = []
    nextChar = str.maketrans({i+96:i%26+97 for i in range(1,27)})
    for c in s:
        while result and result[-1] == c:
            c = result.pop(-1).translate(nextChar)
        result.append(c)
    return "".join(result)

# with my suggestion about the lookup, more readable initialization,
# and some other micro-optimizations
abc = 'abcdefghijklmnopqrstuvwxyz'
g_lookup = dict(zip(abc, abc[1:] + abc[:1]))
def solve_karl(s):
    # put a dummy element in the list to simplify the while loop logic
    result = [None]
    # faster attribute access with a local
    lookup = g_lookup 
    for c in s:
        while result[-1] == c:
            c = lookup[result.pop(-1)]
        result.append(c)
    return "".join(result[1:])

def make_test_string(n):
    return 'ab'*n + 'cc'*n + 'dd'*n

if __name__ == '__main__':
    s = make_test_string(10**3)
    assert solve_user932895(s) == solve_alain(s) == solve_karl(s) == solve_stef(s)

Verifying correctness:

$ python q75690334.py

(no assertion was raised)

Timing with n == 10**3 at the command line:

$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_user932895(s)'
1 loop, best of 5: 1.34 sec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_stef(s)'
50 loops, best of 5: 5.56 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_alain(s)'
200 loops, best of 5: 1.42 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_karl(s)'
500 loops, best of 5: 901 usec per loop

However, an important note I want to make here is that Stef's approach still involves O(N^2) behaviour (because of the slicing) - it just takes longer to show up:

$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**4)' -- 'q75690334.solve_stef(s)'
1 loop, best of 5: 215 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**4)' -- 'q75690334.solve_alain(s)'
20 loops, best of 5: 14.3 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**4)' -- 'q75690334.solve_karl(s)'
50 loops, best of 5: 9.14 msec per loop

Here we see the stack-based approaches taking roughly 10 times as long for 10 times as much input, but the improved slicing approach taking nearly 40 times as long. (Asymptotically, it should end up taking 100 times as long; but inputs long enough to make that clear would take far too long to test.)

I also tried using `itertools.islice` rather than directly slicing `result[1:]` at the end. Although this avoids taking the time to copy the list, the `join` method iterating over the `islice` object is less efficient than a list - apparently by enough that the code becomes slower overall. — Karl Knechtel, Mar 10 '23 at 03:04
islice makes no sense when you know that [join builds a list anyway](https://stackoverflow.com/a/9061024/12671057). If you do want something faster than the list slice, simply use `''` instead of `None` and join the whole list. But it's probably gonna be insignificant anyway. I might do `result += c`, not sure about speed but I like it. — Kelly Bundy, Mar 12 '23 at 05:54
Ah, I assume the reason `.join` makes two passes is so it can compute the length and allocate memory first? As for `result += c`, it would need to be `result += [c]`, and then I suspect the temporary list overhead > the method lookup overhead. — Karl Knechtel, Mar 12 '23 at 06:07
Yes, that's why two passes. No, `result += c` works fine, and you should know :-). Using `''` and not slicing seems to make it ~2% faster (and of course it's more memory-efficient, not that that matters). — Kelly Bundy, Mar 12 '23 at 06:27
Ah, right. `str` is explicitly excluded as a compatible type for `+`, but not for `+=`. Good thinking about using a dummy element that doesn't interfere with the `.join`, though. — Karl Knechtel, Mar 12 '23 at 20:44
I tried that change and didn't reliably get an improvement. It's probably insignificant either way, although it does clearly reduce temporary memory usage. I would have to dig deep into a profiler to get more insight, probably. — Karl Knechtel, Mar 12 '23 at 20:54
Yes, it's insignificant and hard to show in a benchmark reliably. Anyway, [more microoptimizations](https://trinket.io/python3/86e3533dd2) make it ~1.6 times faster. — Kelly Bundy, Mar 12 '23 at 22:18
At this point I think you can write your own answer building off of mine. I certainly wouldn't have expected there was that much more available to do. — Karl Knechtel, Mar 13 '23 at 17:34
Meh... It became too much too ugly code. And I think comparing the characters with `is` is not guaranteed to work, only an implementation detail. And I'm sure yours is plenty fast enough for them. If I post an answer, it'll be something without a Python loop. Btw, how did your `pop(-1)` happen, i.e., why not `pop()`? Mistake or optimization attempt? — Kelly Bundy, Mar 13 '23 at 17:52

score 0 · Answer 4 · answered Mar 10 '23 at 00:20

I think that in this case it is easier to think of appending new letters to the right of the existing string one by one, and if at any point we encounter two identical characters at the end, then we remove both of them and add the next character in alphabetical order.

In this way You will avoid indexing errors and copying the entire string after each modification.

Also, notice that since each operation reduces the length of the result, the complexity will be O(len(s)).

def solve(s):
    def next_char(char):
        if char == "z":
            return "a"
        else:
            return chr(ord(char)+1)
        
    stack = []
    for char in s:
        while len(stack) > 0 and stack[-1] == char:
            stack.pop()
            char = next_char(char)
        stack.append(char)        

    return ''.join(stack)

Cary Swoveland · Answer 5 · 2023-03-10T16:56:18.477

I wish to offer a solution in Ruby. Considering that Ruby and Python have many similarities most readers should be able to at least get a gist of what I am doing. I am posting this answer in the chance that a reader might look upon it favourably and post an equivalent, if not improved-upon, solution in Python.

First create a constant holding a regular expression.

RGX = /(.)\1/

This expression matches a pair of the same character (e.g., "aa"). . matches any character (other than a line terminator), which is saved to capture group 1. \1 matches the content of capture group 1.

Next I will define a constant H that holds a hash that maps pairs of letters into the "successor" of both, such as "aa"->"b" and "zz"->"a".

a = ('a'..'z').to_a
  #=> ["a", "b",..., "y", "z"]

H = a.map { |c| c*2 }.zip(a.rotate).to_h
  #=> {"aa"=>"b", "bb"=>"c",..., "yy"=>"z", "zz"=>"a"}

We may now define a method that takes as its argument the string and returns the desired string.

def doit(str)
  loop do
    break str if str.sub!(RGX, H).nil?
  end
end

Let's try it.

doit("barrsxffzzdggh")
  #=> "batxgadi"

It can be seen that value of str at each iteration is as follows.

bassxffzzdggh
batxffzzdggh
batxgzzdggh
batxgadggh
batxgadhh
batxgadi

I will now break down each step of the method.

First create an array containing the letters of the alphabet.

a = ('a'..'z').to_a
   #=> ["a", "b",..., "y", "z"]

'a'..'z' denotes a range and the method Range#to_a maps it into an array.

The hash H is constructed in four steps.

x = a.map { |c| c*2 }
  #=> ["aa", "bb", "cc",..., "yy", "zz"]
b = a.rotate
  #=> ["b", "c",..., "z", "a"]
c = x.zip(b)
  #=> [["aa", "b"], ["bb", "c"],..., ["yy", "z"], ["zz", "a"]]
H = c.to_h
  #=> {"aa"=>"b", "bb"=>"c",..., "yy"=>"z", "zz"=>"a"}

These steps use the methods Array#map, String#*, Array#rotate, Array#zip and Array#to_h.

Now assign the given string to the variable str.

str = "barrsxffzzdggh"

Ruby's method Kernel#loop more-or-less loops to its end statement until the break keyword is encountered. The single statement within the loop is the following.

    str.sub!(RGX, H)
      #=> "bassxffzzdggh"

This uses the form of Ruby's method String#sub! which matches substrings with its first argument (a string or a regular expression), and uses its second argument, a hash, for making substitutions. This method modifies the string in place. If a replacement is made the string is returned; else nil is returned. (Ruby has a non-destructive counterpart to this methods, sub, and methods gsub! and gsub for making multiple replacements.)

Initially an attempt is made to match str with the regular expression RGX. It matches "rr". As H["rr"] #=> "s", "s" is substituted for "rr" and the modified string is returned (as well as changed in place), so we repeat the loop. This continues until sub! returns nil, at which time we are finished, so we break out of the loop and return the current contents of str.

Unfortunately, the order of replacements is important; and using a regular expression to make multiple substitutions at once, will make them in the wrong order. I tried re-implementing this in Python and found that using it with the input `'aabb'` will result in `'bc'` (each pair is substituted independently), but the established correct result is `'cb'` (the `'aa'` is collapsed first, producing a `'b'`, then the first `'bb'` of `'bbb'` is collapsed). Unless Ruby's `.sub` only makes one substitution? — Karl Knechtel, Mar 13 '23 at 17:48
@Karl, yes, `sub!`, which modifies the string in place, returns after making at most one substitution. `sub` does the same but returns a new string and leaves the original string unchanged. `gsub!` and `gsub` are used multiple substitutions are desired... — Cary Swoveland, Mar 13 '23 at 19:23
...As I understand, in Python, `sub` takes an optional fourth argument that specifies the maximum number of substitutions that can be made. If so, I assume that could be set equal to one. Unless Python has a comparable method that modifies the string in place you would have to save the string length before making the replacement (`str = re.sub(...)`) and return the string if `re.sub` did not reduce the length of the string. — Cary Swoveland, Mar 13 '23 at 19:23
Ah. Python strings are immutable, and one-at-a-time replacements will have to retry the search from the beginning every time, so an approach like this is bound to be slow if it leverages a regex implementation written in C. — Karl Knechtel, Mar 13 '23 at 20:56
@Karl, there certainly is a trade-off between efficiency and transparency. My algorithm, whether implemented in Ruby or in Python, destructive or not, is easy understand and relatively easy to test. I would think a more efficient algorithm is likely to be fairly obtuse, in part because the current index into the string can decrease as well as increase. So it comes down to the importance of time-efficiency, which depends in part on string length, though this is likely just an academic exercise. — Cary Swoveland, Mar 14 '23 at 00:32

Replace two adjacent duplicate characters in string with the next character in alphabet

5 Answers5