44

What's the time complexity of slicing a Python string? Given that Python strings are immutable, I can imagine slicing them being either O(1) or O(n) depending upon how slicing is implemented.

I need to write a function that iterates over all suffixes of a (potentially big) string. I could avoid slicing the string by representing a suffix as a tuple of the entire string plus an index to start reading characters from, but that's ugly. If instead I naively write my function like this:

def do_something_on_all_suffixes(big_string):
    for i in range(len(big_string)):
        suffix = big_string[i:]
        some_constant_time_operation(suffix)

... will its time complexity be O(n) or O(n2), where n is len(big_string)?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
  • 1
    Do you actually care about the complexity, or if it will be "fast enough" for you? If the latter, the best thing to do would be to try the simpler code to write & if ti is fast enough, the complexity doesn't matter. – Scott Hunter Feb 03 '16 at 15:04
  • @MarkAmery would you consider building a _Suffix Array_ or is this irrelevant? This has guaranteed complexity of `O(n)` for strings with length `n` – Yannis P. Feb 03 '16 at 15:18
  • 1
    related: [`sorted(xrange(n), key=lambda i: buffer(data, i))` is used to create a suffix array](http://stackoverflow.com/a/13574862/4279) – jfs Feb 04 '16 at 11:19
  • Related: [Does Python do slice-by-reference on strings?](https://stackoverflow.com/q/5722006/364696) – ShadowRanger Mar 06 '18 at 02:31

3 Answers3

57

Short answer: str slices, in general, copy. That means that your function that does a slice for each of your string's n suffixes is doing O(n2) work. That said, you can avoid copies if you can work with bytes-like objects using memoryviews to get zero-copy views of the original bytes data. See How you can do zero copy slicing below for how to make it work.

Long answer: (C)Python str do not slice by referencing a view of a subset of the data. There are exactly three modes of operation for str slicing:

  1. Complete slice, e.g. mystr[:]: Returns a reference to the exact same str (not just shared data, the same actual object, mystr is mystr[:] since str is immutable so there is no risk to doing so)
  2. The zero length slice and (implementation dependent) cached length 1 slices; the empty string is a singleton (mystr[1:1] is mystr[2:2] is ''), and low ordinal strings of length one are cached singletons as well (on CPython 3.5.0, it looks like all characters representable in latin-1, that is Unicode ordinals in range(256), are cached)
  3. All other slices: The sliced str is copied at creation time, and thereafter unrelated to the original str

The reason why #3 is the general rule is to avoid issues with large str being kept in memory by a view of a small portion of it. If you had a 1GB file, read it in and sliced it like so (yes, it's wasteful when you can seek, this is for illustration):

with open(myfile) as f:
    data = f.read()[-1024:]

then you'd have 1 GB of data being held in memory to support a view that shows the final 1 KB, a serious waste. Since slices are usually smallish, it's almost always faster to copy on slice instead of create views. It also means str can be simpler; it needs to know its size, but it need not track an offset into the data as well.

How you can do zero copy slicing

There are ways to perform view based slicing in Python, and in Python 2, it will work on str (because str is bytes-like in Python 2, supporting the buffer protocol). With Py2 str and Py3 bytes (as well as many other data types like bytearray, array.array, numpy arrays, mmap.mmaps, etc.), you can create a memoryview that is a zero copy view of the original object, and can be sliced without copying data. So if you can use (or encode) to Py2 str/Py3 bytes, and your function can work with arbitrary bytes-like objects, then you could do:

def do_something_on_all_suffixes(big_string):
    # In Py3, may need to encode as latin-1 or the like
    remaining_suffix = memoryview(big_string)
    # Rather than explicit loop, just replace view with one shorter view
    # on each loop
    while remaining_suffix:  # Stop when we've sliced to empty view
        some_constant_time_operation(remaining_suffix)
        remaining_suffix = remaining_suffix[1:]

The slices of memoryviews do make new view objects (they're just ultra-lightweight with fixed size unrelated to the amount of data they view), just not any data, so some_constant_time_operation can store a copy if needed and it won't be changed when we slice it down later. Should you need a proper copy as Py2 str/Py3 bytes, you can call .tobytes() to get the raw bytes obj, or (in Py3 only it appears), decode it directly to a str that copies from the buffer, e.g. str(remaining_suffix[10:20], 'latin-1').

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Interesting. I wonder if there's a cunning way that Python could work around the reason you give for point #3 and get `O(1)` slices without the horrible memory implications. A question for me to pose on Programmers or CS Stack Exchange, maybe... – Mark Amery Feb 03 '16 at 15:47
  • 3
    @msw: They've [looked at `strview` as a possibility](https://mail.python.org/pipermail/python-ideas/2011-December/012993.html); nothing has come of it so far. Early proposals (without a dedicated view type) were [rejected because of the "keep-alive effect"](https://bugs.python.org/issue26077#msg258374) where huge `str` stick around due to a small slice; only explicit use of views would be considered (so the keep alive effect is explicit). – ShadowRanger Feb 03 '16 at 16:15
  • 1
    @MarkAmery: I just added a lengthy explanation on how you can (in some cases) use `memoryview` to do view slicing instead of copy slicing. If you're on Py2 with `str` or Py3 and can encode to a `bytes` object once up front, and the function can use `bytes`-like objects, that will be enough. – ShadowRanger Feb 03 '16 at 16:17
  • @msw You seem to think I *have* a clever scheme up my sleeve. I don't. It's a non-trivial problem and for all I know it may be (provably) impossible to allow `O(1)` string slicing without either [bad effects on garbage collection performance] or [inability to garbage collect giant strings that have been sliced from]. I'm not even sure what I formally mean by [bad effects on garbage collection performance] yet, let alone whether the problem I haven't formalised is solvable. – Mark Amery Feb 03 '16 at 16:47
  • [`buffer()` can be more convenient and faster on Python 2.](http://stackoverflow.com/a/13574862/4279) – jfs Feb 04 '16 at 11:26
  • @jfs: True, though the cost is that: 1) It's not portable to Py3 at all. 2) It's slower for repeatedly reducing the view (`memoryview` can repeatedly do `view = view[1:]`, `buffer` has to repeatedly call `buffer` each time, `buf = buffer(buf, 1)`, because slicing it extracts a `str` copy of the contents). 3) It's less flexible; it's a view of bytes only, and only understands offset and length, so it can't provide, say, `int`-width representations of data, or include a step in the view to skip alternating elements or the like. – ShadowRanger Jan 18 '19 at 20:40
  • If they ever address [issue 20399](https://bugs.python.org/issue20399), it should regain parity with `buffer` on the most glaring flaw (the lack of rich comparisons `<`/`>`/`<=`/`>=`). – ShadowRanger Jan 18 '19 at 20:41
4

It all depends on how big your slices are. I threw together the following two benchmarks. The first slices the entire string and the second only a little bit. Curve fitting with this tool gives

# s[1:-1]
y = 0.09 x^2 + 10.66 x - 3.25

# s[1:1000]
y = -0.15 x + 17.13706461

The first looks quite linear for slices of strings up to 4MB. I guess this really measures the time taken to construct a second string. The second is pretty constant, although it's so fast it's probably not that stable.

enter image description hereenter image description here

import time

def go(n):
    start = time.time()
    s = "abcd" * n
    for j in xrange(50000):

        #benchmark one
        a = s[1:-1]

        #benchmark two
        a = s[1:1000]

    end = time.time()
    return (end - start) * 1000

for n in range(1000, 100000, 5000):
    print n/1000.0, go(n)
jozxyqk
  • 16,424
  • 12
  • 91
  • 180
  • 2
    Well the term `0.086 x^2 + 10.66 x - 3.25` is not very linear to me :) but yes for relatively small `n` looks linear... – Yannis P. Feb 03 '16 at 15:21
  • @YannisP. I presume (and hope!) that the `0.086 x^2` is just the curve fitter being oversensitive to random variation, rather than slicing really being an `O(n^2)` operation. Slicing being `O(n^2)` would be appalling! – Mark Amery Feb 03 '16 at 15:25
  • Fair enough. Also it would make good sense to me if Python and other programming languages, copy the memory locations of the start and the end of a string slice, rather than the entire string, thus yielding `O(1)` for the slicing – Yannis P. Feb 03 '16 at 15:27
  • 1
    @YannisP. yes... but I guess that would make the code for garbage collecting strings more complicated than simply implementing slices with copying, and possibly introduce performance problems at collection time. It's a complicated problem and I can't quite figure out at first glance whether it's possible to get `O(1)` slicing without either hurting the efficiency of garbage collection or losing the ability to garbage collect giant strings when there's still a living reference to a tiny slice of them. – Mark Amery Feb 03 '16 at 15:31
1

Passing around the indices is not so bad if you just pack them in together in an object

from dataclasses import dataclass

@dataclass
class StrSlice:
    str: str
    start: int
    end: int

def middle(slice):
    return slice.str[(slice.start + slice.end) // 2]

def reverse(slice):
    return slice.str[slice.start : slice.end][::-1]

def length(slice):
    return slice.end - slice.start

def chars(slice):
    yield from (slice.str[i] for i in range(slice.start, slice.end)

def equals(slice1, slice2):
    if length(slice1) != length(slice2):
        return False
    return all(c1 == c2 for c1, c2 in zip(chars(slice1), chars(slice2))
Luke Miles
  • 941
  • 9
  • 19