Python's immutable strings and their slices

Question

The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the .split() method. However, a new string object is constructed each time. Why? The single reason I see is that it can make garbage collection a bit more difficult.

True: in regular situations the memory overhead is linear and isn't noticeable. Copying is fast, and so is allocation. But there is already too much done in Python, so maybe such buffers are worth the effort?

EDIT:

It seems that forming substrings this way would make memory management much more complicated. The case where only 20% of the arbitrary string is used, and we can't deallocate the rest of the string, is a simple example. We can improve the memory allocator, so it would be able to deallocate strings partially, but probably it would be mostly a disprovement. All the standard functions can anyway be emulated with buffer or memoryview if memory becomes critical. The code wouldn't be that concise, but one has to give up something in order to get something.

How would you return parts of a string? You mean pointers to it? what happens to the child string if the original string is deleted? — Ashwini Chaudhary, Aug 04 '13 at 10:45
What becomes to the buffer, if the original object is deleted? I think, the garbage collection is smart enough and doesn't delete the original object. — gukoff, Aug 04 '13 at 10:48
I believe your question is a duplicate of [If strings are immutable in .NET, then why does Substring take O(n) time?](http://stackoverflow.com/questions/6742923/if-strings-are-immutable-in-net-then-why-does-substring-take-on-time). Same arguments are valid for python. — Bakuriu, Aug 04 '13 at 11:57

score 3 · Answer 1 · edited May 23 '17 at 12:14

3

That's how slices work. Slices always perform a shallow copy, allowing you to do things like

>>> x = [1,2,3]
>>> y = x[:]

Now it would be possible to make an exception for strings, but is it really worth it? Eric Lippert blogged about his decision not to do that for .NET; I guess his argument is valid for Python as well.

See also this question.

edited May 23 '17 at 12:14

Community

1
1

answered Aug 04 '13 at 10:45

Tim Pietzcker

328,213
58
503
561

Yes, this is very useful, when iterating the mutable object, for example. But here, for the strings? It wouldn't violate the consistency. – gukoff Aug 04 '13 at 10:49
@Harold: True, but perhaps it's not worth the effort. I've edited my answer. – Tim Pietzcker Aug 04 '13 at 10:53
It seems that Eric tried to make the serious optimization, even for concatenations. So the StringBuilder class does really nivelate the need in such complex optimizations in C#. – gukoff Aug 04 '13 at 11:48

Bakuriu · Accepted Answer · 2013-08-04T11:51:20.727

The underlying string representation is null-terminated, even though it keeps track of the length, hence you cannot have a string object that references a sub-string that isn't a suffix. This already limits the usefulness of your proposal since it would add a lot of complications to deal differently with suffices and non-suffices (and giving up with null-terminating strings brings other consequences).

Allowing to refer to sub-strings of a string means to complicate a lot garbage collection and string handling. For every string you'd have to keep track how many objects refer to each character, or to each range of indices. This means complicating a lot the struct of string objects and any operation that deals with them, meaning a, probably big, slow down.

Add the fact that starting with python3 strings have 3 different internal representations, and things are going to be too messy to be maintainable, and your proposal probably doesn't give enough benefits to be accepted.

An other problem with this kind of "optimization" is when you want to deallocate "big strings":

a = "Some string" * 10 ** 7
b = a[10000]
del a

After this operations you have the substring b that prevents a, a huge string, to be deallocated. Surely you could do copies of small strings, but what if b = a[:10000](or another big number)? 10000 characters looks like a big string which ought to use the optimization to avoid copying, but it is preventing to realease megabytes of data. The garbage collector would have to keep checking whether it is worth to deallocate a big string object and make copies or not, and all these operations must be as fast as possible, otherwise you end up decreasing time-performances.

99% of the times the strings used in the programs are "small"(max 10k characters), hence copying is really fast, while the optimizations you propose start to become effective with really big strings(e.g. take substrings of size 100k from huge texts) and are much slower with really small strings, which is the common case, i.e. the case that should be optimized.

If you think important then you are free to propose a PEP, show an implementation and the resultant changes in speed/memory usage of your proposal. If it is really worth the effort it may be included in a future version of python.

I'm don't think OP's proposal is a good idea, but for different reasons ("don't shoot the message"). The first paragraph is interesting, and certainly a problem with tacking this onto CPython as-is, but it's not a fundamental problem and it wouldn't be the first time the string representation changes greatly. The second paragraph assumes a stupid implementation. You could just have the sub-string refer to the string object proper and store a start and length/end (and this representation is perfectly ignorant of the string object's guts, so the third paragraph disappears in a puff of logic). — , Aug 04 '13 at 11:39
The phrase about the null-terminated representation style is pretty cogently. But it is interesting: is this style useful at all? For the functions like strcpy, maybe, but they can be changed with functions like strncpy... — gukoff, Aug 04 '13 at 11:46
@Harold I don't think it's for `strcpy`, as the length is available it's simpler and faster to just `memcpy` - likewise for other standard functions. It's probably to avoid a copy when turning it into a C string for third party/client code (see `PyBytes_AsString` and the like). — , Aug 04 '13 at 11:53

glglgl · Answer 3 · 2013-08-04T10:57:28.983

If you are worried about memory (in the case of really large strings), use a buffer():

>>> a = "12345"
>>> b = buffer(a, 2, 2)
>>> b
<read-only buffer for 0xb734d120, size 2, offset 2 at 0xb734d4a0>
>>> print b
34
>>> print b[:]
34

Knowing about this allows you for alternatives to string methods such as split().

If you want to split() a string, but keep the original string object (as you maybe need it), you could do:

def split_buf(s, needle):
    start = None
    add = len(needle)
    res = []
    while True:
        index = s.find(needle, start)
        if index < 0:
            break
        res.append(buffer(s, start, index-start))
        start = index + add
    return res

or, using .index():

def split_buf(s, needle):
    start = None
    add = len(needle)
    res = []
    try:
        while True:
            index = s.index(needle, start)
            res.append(buffer(s, start, index-start))
            start = index + add
    except ValueError:
        pass
    return res

Yes, I know about the buffers. But what if I want to use the split() method on the arbitrary string? — gukoff, Aug 04 '13 at 10:50
@Harold You can "emulate" it fulfilling your needs, see my edit. OTOH, if you split a string and don't need it any longer, you can drop the original, freeing the memory and having about the same memory footprint as before. — glglgl, Aug 04 '13 at 10:58

Python's immutable strings and their slices

3 Answers3