105

I am working on a problem out of CTCI.

The third problem of chapter 1 has you take a string such as

'Mr John Smith '

and asks you to replace the intermediary spaces with %20:

'Mr%20John%20Smith'

The author offers this solution in Python, calling it O(n):

def urlify(string, length):
    '''function replaces single spaces with %20 and removes trailing spaces'''
    counter = 0
    output = ''
    for char in string:
        counter += 1
        if counter > length:
            return output
        elif char == ' ':
            output = output + '%20'
        elif char != ' ':
            output = output + char
    return output

My question:

I understand that this is O(n) in terms of scanning through the actual string from left to right. But aren't strings in Python immutable? If I have a string and I add another string to it with the + operator, doesn't it allocate the necessary space, copy over the original, and then copy over the appending string?

If I have a collection of n strings each of length 1, then that takes:

1 + 2 + 3 + 4 + 5 + ... + n = n(n+1)/2

or O(n^2) time, yes? Or am I mistaken in how Python handles appending?

Alternatively, if you'd be willing to teach me how to fish: How would I go about finding this out for myself? I've been unsuccessful in my attempts to Google an official source. I found https://wiki.python.org/moin/TimeComplexity but this doesn't have anything on strings.

smci
  • 32,567
  • 20
  • 113
  • 146
user5622964
  • 1,043
  • 2
  • 8
  • 6
  • 18
    Someone should tell the author about `urllib.urlencode` – wim Nov 30 '15 at 21:08
  • 11
    @wim It's meant to be a practice problem about arrays and strings – user5622964 Nov 30 '15 at 21:09
  • 3
    The purpose of the book is to teach interview questions, which commonly ask you to re-invent the wheel to see the interviewee's thought process. – James Wierzba Nov 30 '15 at 21:09
  • @RNar I guess that's what I'm asking: When appending to a string, what's actually happening under the hood? How can a copy take constant time? – user5622964 Nov 30 '15 at 21:13
  • 1
    Since it is Python, I think doing an `rtrim` and `replace` would be more preferred and in the ballpark of `O(n)`. Copying over strings does seem the least efficient way. – OneCricketeer Nov 30 '15 at 21:13
  • 2
    @RNar Can you explain how a copy can take constant time? – James Wierzba Nov 30 '15 at 21:13
  • @cricket_007 If this were IRL and I didn't have urlencode for whatever reason, I'd apply a trim and then string replace -- but this is an interview question where they are likely to say something like "Do it without using string replace or any external libraries like urlencode" – user5622964 Nov 30 '15 at 21:14
  • @user5622964: it's not because you do `a+b`, it means Python will *immediately* make a copy and do the appending: one can do lazy programming: simply remember you have to do this with the operands, and when it you finally need the value, take a look how it is organized and select the most effecient way to evaluate `a+b+c+d+...+z`. – Willem Van Onsem Nov 30 '15 at 21:20
  • 1
    @JamesWierzba, copy part of immutable string can take constant time, if it not performs actual copy, while not modify operations requested. Just save 2 pointers (head, length). – vp_arth Dec 01 '15 at 15:26
  • You can use the ctypes standard library's mutable string buffers.http://stackoverflow.com/a/26172377/2099608 – FazeL Dec 02 '15 at 20:27
  • Instead of incrementing `counter` manually you should use `for counter, char in enumerate(string)` ([documentation for `enumerate`](https://docs.python.org/3/library/functions.html#enumerate)). – Cristian Ciupitu Mar 18 '19 at 20:56
  • FYI, I think CTCI is the book "Cracking the Code Interview" by Gayle Laakmann McDowell. – mherzog Oct 16 '22 at 15:01

4 Answers4

98

In CPython, the standard implementation of Python, there's an implementation detail that makes this usually O(n), implemented in the code the bytecode evaluation loop calls for + or += with two string operands. If Python detects that the left argument has no other references, it calls realloc to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it's an implementation detail and because if realloc ends up needing to move the string frequently, performance degrades to O(n^2) anyway.

Without the weird implementation detail, the algorithm is O(n^2) due to the quadratic amount of copying involved. Code like this would only make sense in a language with mutable strings, like C++, and even in C++ you'd want to use +=.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
user2357112
  • 260,549
  • 28
  • 431
  • 505
  • 2
    I am looking at the code you linked... it looks like a big part of that code is cleaning up / removing pointers/references to the string being appended, correct? And then towards the end it performs `_PyString_Resize(&v, new_len)` to allocate the memory for the concatenated string, and then `memcpy(PyString_AS_STRING(v) + v_len, PyString_AS_STRING(w), w_len);` which does the copy. If in-place resizing fails, it does `PyString_Concat(&v, w);` (I assume this means when the contiguous memory at the end of the original string address isn't free). How does this show the speedup? – user5622964 Nov 30 '15 at 21:25
  • I ran out of space in my previous comment, but my question there is whether or not I am understanding that code correctly and how to interpret the memory usage/runtimes of those pieces. – user5622964 Nov 30 '15 at 21:27
  • 1
    @user5622964: Whoops, misremembered the weird implementation detail. There's no efficient resize policy; it just calls `realloc` and hopes for the best. – user2357112 Nov 30 '15 at 21:35
  • How does `memcpy(PyString_AS_STRING(v) + v_len, PyString_AS_STRING(w), w_len);` work? According to http://www.cplusplus.com/reference/cstring/memcpy/ it has definition `void * memcpy ( void * destination, const void * source, size_t num );` and description: `"Copies the values of num bytes from the location pointed to by source directly to the memory block pointed to by destination."` The num in this case is the size of the appending string, and source is the address of the second string, I assume? But then why is the destination (first string) + len(first string)? Double memory? – user5622964 Nov 30 '15 at 21:44
  • Does passing an object into memcpy give the address of that object or something like that? Edit: Actually that might make sense, since if it returns the address, then adding the length is telling it to start copying blocks at the end of v. Are strings stored contiguously in memory? Is `PyString_Concat(&v, w);` a full re-copy of strings `v` and `w`? – user5622964 Nov 30 '15 at 21:46
  • 8
    @user5622964: That's pointer arithmetic. If you want to understand the CPython source code down to the weird implementation details, you'll need to know C. The super-condensed version is that `PyString_AS_STRING(v)` is the address of the first string's data, and adding `v_len` gets you the address right after the string's data ends. – user2357112 Nov 30 '15 at 21:49
  • Repeated `realloc` to append is O(n) as long as each allocation is twice (or really any multiple) of the previous length. – Ben Jackson Dec 01 '15 at 08:53
  • @BenJackson: Python doesn't do that, though, since it'd require extra space for all strings in the more important, general case where you aren't concatenating strings with `+` or `+=` in a loop. This is just an opportunistic optimization, not an explicit attempt to guarantee O(n) performance for this string concatenation pattern. – user2357112 Dec 01 '15 at 18:12
  • Don't try this in Java! – wrschneider Nov 17 '21 at 15:21
53

The author relies on an optimization that happens to be here, but is not explicitly dependable. strA = strB + strC is typically O(n), making the function O(n^2). However, it is pretty easy to make sure it the whole process is O(n), use an array:

output = []
    # ... loop thing
    output.append('%20')
    # ...
    output.append(char)
# ...
return ''.join(output)

In a nutshell, the append operation is amortized O(1), (although you can make it strong O(1) by pre-allocating the array to the right size), making the loop O(n).

And then the join is also O(n), but that's okay because it is outside the loop.

njzk2
  • 38,969
  • 7
  • 69
  • 107
27

I found this snippet of text on Python Speed > Use the best algorithms and fastest tools:

String concatenation is best done with ''.join(seq) which is an O(n) process. In contrast, using the '+' or '+=' operators can result in an O(n^2) process because new strings may be built for each intermediate step. The CPython 2.4 interpreter mitigates this issue somewhat; however, ''.join(seq) remains the best practice

gsamaras
  • 71,951
  • 46
  • 188
  • 305
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
4

For future visitors: Since it is a CTCI question, any reference to learning urllib package is not required here, specifically as per OP and the book, this question is about Arrays and Strings.

Here's a more complete solution, inspired from @njzk2's pseudo:

text = 'Mr John Smith'#13 
special_str = '%20'
def URLify(text, text_len, special_str):
    url = [] 
    for i in range(text_len): # O(n)
        if text[i] == ' ': # n-s
            url.append(special_str) # append() is O(1)
        else:
            url.append(text[i]) # O(1)

    print(url)
    return ''.join(url) #O(n)


print(URLify(text, 13, '%20'))
geekidharsh
  • 3,549
  • 1
  • 20
  • 24