11

Say I have very large bytes object (after loading binary file) and I want to read parts by parts and advance the starting position until it meets the end. I use slicing to accomplish this. I'm worried that python will create completely new copy each time I ask for a slice instead of simply giving me the address of the memory pointing to the position I want.

Simple example:

data = Path("binary-file.dat").read_bytes()
total_length = len(data)
start_pos = 0

while start_pos < total_length:
   bytes_processed = decode_bytes(data[start_pos:])  # <---- ***
   start_pos += bytes_processed 

In the above example does python creates completely new copy of bytes object starting from the start_pos due to the slicing. If so what is the best way to avoid data copy and use just a pointer to pass to the relevant position of the bytes array.

user3840170
  • 26,597
  • 4
  • 30
  • 62
Tekz
  • 1,279
  • 14
  • 20
  • 1
    slices create new `bytes` objects. Now, it would be *possible* for python not to copy the underlying buffer, and share a buffer among slices. However, in general that is the case, and it will effectively copy the underlying buffer. Try this out yourself, `b = b'a'*1_000_000_000` should take about a gig of memory. Now interestingly, if you do a full copy, it doesn't seem to copy the underlying buffer, so `b2 = b[:]`, *however*, anything else, and it does, so `b3 = b[1:]`. – juanpa.arrivillaga Jun 23 '20 at 11:49
  • 3
    If you are working with `bytes` and you want a memory efficient way to slice them, use `memoryview` Note, *python doesn't have pointers* – juanpa.arrivillaga Jun 23 '20 at 11:50
  • @juanpa.arrivillaga, thanks I'm from other programming languages, I thought pythin treat byte as a primitive type and hence possibly a copy (if it is a list with different object type it will still copy the references to new list as well) so this is not ideal in my case as I have to traverse through 12-20MB of data by 4000+ bytes per time, and have stream of such files, as you said memoryview could be the solution. Thanks – Tekz Jun 24 '20 at 06:59
  • @jpnadas yes that gives some insight but I'm talking about bytes object, since it is immutable (where you cannot modify any element of the array) I thought slicing would not create new or copy references. – Tekz Jun 24 '20 at 07:02
  • @Tekz it is important to understand, python *doesn't have primitive types*. Everything is an object. `bytes` are immutable, so it may or may not perform a copy of the underlying buffer. But python hasn't optimized for this, which it could (for example, `numpy` arrays do this, and `memoryiew` objects do this), apparently, except for the simply case of an empty slice, where it simply *returns the same bytes object*, i.e. `x = b"abcde"; print(x[:] is x)` will print `True`. – juanpa.arrivillaga Jun 24 '20 at 08:29
  • 2
    @Tekz that other question about lists is actually totally irrelevant here. `list` objects contain other python objects. A `bytes` object is essentially an object-oriented wrapper over a primitive buffer of bytes. Although it acts as a container, it doesn't actually *contain* other python objects, although, indexing returns python objects (ints, actually) and you can do membership testing with other `bytes` objects, but internally, there are no references to other python objects, just a primitive buffer of bytes, a char array basically. – juanpa.arrivillaga Jun 24 '20 at 08:32
  • @juanpa.arrivillaga I could have but actually I'm using 3rd party library for `decode_bytes` function so it accepts bytes only, im not sure `numpy.ndarray` can be passed directly – Tekz Jun 24 '20 at 11:17

1 Answers1

4

Yes, slicing a bytes object does create a copy, at least as of CPython 3.9.12. The closest the documentation comes to admitting this is in the description of the bytes constructor:

In addition to the literal forms, bytes objects can be created in a number of other ways:

  • A zero-filled bytes object of a specified length: bytes(10)
  • From an iterable of integers: bytes(range(20))
  • Copying existing binary data via the buffer protocol: bytes(obj)

which suggests any creation of a bytes object creates a separate copy of the data. But since I had a hard time finding an explicit confirmation that slicing does the same, I resorted to an empirical test.

>>> b = b'\1' * 100_000_000
>>> qq = [b[1:] for _ in range(20)]

After executing the first line, memory usage of the python3 process in top was about 100 MB. The second executed after a considerable delay, making memory usage rise to the level of 2G. This seems pretty conclusive. PyPy 7.3.9 targetting Python 3.8 behaves largely the same; though of course, PyPy’s garbage collection is not as eager as CPython’s, so the memory is not freed as soon as the bytes objects become unreachable.

To avoid copying the underlying buffer, wrap your bytes in a memoryview and slice that:

>>> bm = memoryview(b)
>>> qq = [bm[1:] for _ in range(50)]
user3840170
  • 26,597
  • 4
  • 30
  • 62
  • That's CPython, right? If it's not specified, other implementations could do it differently. – Kelly Bundy Apr 05 '22 at 09:43
  • From the [PyPy documentation](https://doc.pypy.org/en/latest/cpython_differences.html) about differences to CPython: *"equal strings may share their internal string data even if they are different objects—even a unicode string and its utf8-encoded bytes version are shared"* (talks about strings and equal, but I could imagine it shares data for bytes slices as well). – Kelly Bundy Apr 05 '22 at 09:46
  • I did the same test under PyPy, it behaved largely identically. – user3840170 Apr 05 '22 at 10:15