42

I'm wondering if :

a = "abcdef"
b = "def"
if a[3:] == b:
    print("something")

does actually perform a copy of the "def" part of a somewhere in memory, or if the letters checking is done in-place ?

Note : I'm speaking about a string, not a list (for which I know the answer)

Fred
  • 901
  • 1
  • 8
  • 16
  • 4
    `a[3:]` creates a new object using new memory space. – Klaus D. Nov 17 '20 at 08:11
  • @KlausD. are you sure about that? I've always known that slicing has the advantage of saving only references to the original element and even though saving only references is an action that occupies memory too, it is not comparable to talking about a copy of the elements. Can you explain a little bit more? – lorenzozane Nov 17 '20 at 08:14
  • 1
    @KlausD. - Its an interesting question, though. Its a new object, but does it reference the same physical memory as the first string or is the data itself copied? I'm 99% sure python copies the data even though one would think that slices are just views into the immutable string memory, but I can't find a reference. – tdelaney Nov 17 '20 at 08:14
  • It's easy to test: make a slice, change the original, see that the slice has not changed. (Doesn't work on stings since they are not mutable.) – Klaus D. Nov 17 '20 at 08:20
  • 2
    @LorenzoZane "slicing" is whatever the type defines it to be. In the case of strings, it creates a new string object. The runtime is free to optimize strings in various ways since they are immutable, but slicing just makes a copy. – juanpa.arrivillaga Nov 17 '20 at 08:44
  • @juanpa.arrivillaga Got it, thanks! – lorenzozane Nov 17 '20 at 08:47
  • 4
    Making a reference to a substring of the existing string (without copying) would save on memory in some circumstances but not others; for example, if you have a huge string and keep only a small substring, then the huge string can be garbage-collected only if the substring is a copy. – kaya3 Nov 17 '20 at 08:47
  • @kaya3 yes, for the specialized use case, `memoryview` objects exist. – juanpa.arrivillaga Nov 17 '20 at 08:51
  • @user2357112 I think this should be duped in the other direction. – wim Nov 18 '20 at 15:36
  • @wim I think it would be better to ask a mod to [merge](https://meta.stackexchange.com/q/158066/378331) them. – Georgy Nov 18 '20 at 17:43

2 Answers2

40

String slicing makes a copy in CPython.

Looking in the source, this operation is handled in unicodeobject.c:unicode_subscript. There is evidently a special-case to re-use memory when the step is 1, start is 0, and the entire content of the string is sliced - this goes into unicode_result_unchanged and there will not be a copy. However, the general case calls PyUnicode_Substring where all roads lead to a memcpy.

To empirically verify these claims, you can use a stdlib memory profiling tool tracemalloc:

# s.py
import tracemalloc

tracemalloc.start()
before = tracemalloc.take_snapshot()
a = "." * 7 * 1024**2  # 7 MB of .....   # line 6, first alloc
b = a[1:]                                # line 7, second alloc
after = tracemalloc.take_snapshot()

for stat in after.compare_to(before, 'lineno')[:2]:
    print(stat)

You should see the top two statistics output like this:

/tmp/s.py:6: size=7168 KiB (+7168 KiB), count=1 (+1), average=7168 KiB
/tmp/s.py:7: size=7168 KiB (+7168 KiB), count=1 (+1), average=7168 KiB

This result shows two allocations of 7 meg, strong evidence of the memory copying, and the exact line numbers of those allocations will be indicated.

Try changing the slice from b = a[1:] into b = a[0:] to see that entire-string-special-case in effect: there should be only one large allocation now, and sys.getrefcount(a) will increase by one.

In theory, since strings are immutable, an implementation could re-use memory for substring slices. This would likely complicate any reference-counting based garbage collection process, so it may not be a useful idea in practice. Consider the case where a small slice from a much larger string was taken - unless you implemented some kind of sub-reference counting on the slice, the memory from the much larger string could not be freed until the end of the substring's lifetime.

For users that specifically need a standard type which can be sliced without copying the underlying data, there is memoryview. See What exactly is the point of memoryview in Python for more information about that.

wim
  • 338,267
  • 99
  • 616
  • 750
2

Possible talking point (feel free to edit adding information).

I have just written this test to verify empirically what the answer to the question might be (this cannot and does not want to be a certain answer).

import sys

a = "abcdefg"

print("a id:", id(a))
print("a[2:] id:", id(a[2:]))
print("a[2:] is a:", a[2:] is a)

print("Empty string memory size:", sys.getsizeof(""))
print("a memory size:", sys.getsizeof(a))
print("a[2:] memory size:", sys.getsizeof(a[2:]))

Output:

a id: 139796109961712
a[2:] id: 139796109962160
a[2:] is a: False
Empty string memory size: 49
a memory size: 56
a[2:] memory size: 54

As we can see here:

  • the size of an empty string object is 49 bytes
  • a single character occupies 1 byte (Latin-1 encoding)
  • a and a[2:] ids are different
  • the occupied memory of each a and a[2:] is consistent with the memory occupied by a string with that number of char assigned
lorenzozane
  • 1,214
  • 1
  • 5
  • 15
  • 1
    Thanks, tried it with longer strings to ensure the consistency. You seems totally right ! – Fred Nov 17 '20 at 08:52
  • This does not prove that Python isn't using the same memory for the substring. The underlying PyObjects could, in theory, reuse memory without contradicting anything you've shown here (note that id being related to memory address is an implementation detail of CPython, and they refer to the address of the PyObject header anyway, not necessarily the memory locations of the underlying data storage for a unicode object) – wim Nov 17 '20 at 09:00
  • Indeed, I'm changing the answer flag (but thanks anyway to Lorenzo Zane, my mind changed with yours) – Fred Nov 17 '20 at 09:04
  • @wim As I have written clearly, my answer is meant to be a cue for discussion. Not something certain, because I'm not able to give a certain answer. I would like you to discuss it for possible improvement. – lorenzozane Nov 17 '20 at 09:05
  • @LorenzoZane Yes that's what I'm doing. – wim Nov 17 '20 at 09:22