2

I am trying to find the time complexity of str.replace() built into python, and this is the data I've managed to gather (here and on other sites):

I know replace() is based on the Boyer–Moore algorithm, which takes worst-case time of O(n*m) to find a substring, but is this for a single substring?

Does replace() return a copy of the "fixed" string when it finds the first substring and then start to search again?

And what about when we have multiple occurrences of a substring, like the following example:

old_string = '192.168.1.1'
new_string = old_string.replace('.', '|')

If it can only replace one substring at a time, then for a single substring we get O(n*m), times the number of substrings which is a maximum of n/m. This comes to O(n^2)!

Given that a simple loop would take O(n), like:

old_string = '192.168.1.1'
new_string = []
for ch in old_string:
    new_string.append('|' if ch == '.' else ch)

Does that make sense? Am I missing something?

Could the built in replace() be flawed for multiple replacements, or is it implemented in a way to continue where it left off?

martineau
  • 119,623
  • 25
  • 170
  • 301
Oron Werner
  • 1,211
  • 1
  • 7
  • 14
  • 1
    Does this answer your question? [What is the Big O notation for the str.replace function in Python?](https://stackoverflow.com/questions/35583983/what-is-the-big-o-notation-for-the-str-replace-function-in-python) – jidicula Feb 11 '21 at 21:48
  • 1
    This was one of the threads I read during the search for an answer. However, unfortunately, the listed url no longer work and I could not verify anywhere that indeed replace() copy the string after each replacement. Also, a comment to this answer from 2020 suggests that due to the length of the string getting shorter as you move forward - the worst case is not n*n. If you have any other reference that shows this is how replace() works, it will be amazing! – Oron Werner Feb 11 '21 at 22:09
  • 1
    [Here is the source code for the `replace` function](https://github.com/python/cpython/blob/ea46579067fd2d4e164d6605719ffec690c4d621/Objects/unicodeobject.c#L10835), you can use this to figure it out. – SethMMorton Feb 11 '21 at 22:21
  • 2
    @jidicula It may be an answer the question, but that answer is wrong. – btilly Feb 11 '21 at 22:36

1 Answers1

5

The worst case is O(n*(m1 + m2/m1)) where n is the length of the string, m1 is the length of the searched for string and m2 is the length of the replacement.

The average case is O(n * (1 + m2/m1)).

In principle the algorithm looks like this:

initialize data structures.     # max time O(n)
while find next match:          # max time O(n*m1)
    copy unchanged string.      # max time O(n)
    copy replacement            # max time O((n/m1) * m2) + O(n)
copy rest of the string         # max time O(n)

There are a lot of details. (For example they have to manage memory, and take fast paths for cases like the replacement is the size of the original.) But here is an explanation of each step and why it takes that time.

  1. You are initializing a data structure to take the result. This initialization is fast, but it is O(n) data to initialize so time O(n).
  2. Finding all matches is worst case that for each character you compare m1-1 characters forward, fail to match the last, back up and try again. Hence that can be O(n*m1).
  3. Copying O(n) data takes O(n) time.
  4. There can be at most O(n/m1) matches, each of which we copy m2 data for. However we can also exceed the size we allocated to put data in. In that case we have to create a new place to put data, copy over what we did, then proceed. The thresholds for resizing are chosen so that the total cost has a maximum O(n) time cost.
  5. There can be at most O(n) data after the last match.

Add those together and absorb O(n) terms into O(n*m1) and you get the original estimate.

Going back to the average case, the string search usually won't go to near the end of the substring before falling back. Most letters don't match. Most of the time if the first letter matches, the second one does not. And so on. So the search is usually going to be O(n). Drop that out and you get to the other estimate.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • Thank you so much for the detailed explanation! I've accepted your answer. Just out of curiosity, what would be the worst case for this algorithm? Is it when ```m1 = n/2```? Thanks! – Oron Werner Feb 13 '21 at 10:48
  • @OronW It would be looking for something of the form AA...AB in a string of the form AA...A in a string whose representation was some kind of Unicode that blocks the use of Boyer-Moore. – btilly Feb 14 '21 at 05:23