2

I am aware that io.BytesIO() returns a binary stream object which uses in-memory buffer. but also provides getbuffer() which provides a readable and writable view (memoryview obj) over the contents of the buffer without copying them.

obj = io.BytesIO(b'abcdefgh')
buf = obj.getbuffer()

Now, we know buf points to underlying data and when sliced(buf[:3]) returns a memoryview object again without making a copy. So I want to know, if we do obj.read(3) does it also uses in-memory buffer or makes a copy ?. if it does uses in-memeory buffer, what is the difference between obj.read and buf and which one to prefer to effectively read the data in chunks for considerably very long byte objects ?

GIZ
  • 4,409
  • 1
  • 24
  • 43
Scarface
  • 359
  • 2
  • 13

1 Answers1

1

Simply put, BytesIO.read reads data from the in-memory buffer. The method reads the data and returns as bytes objects and gives you a copy of the read data. buf however, is a memory view object that views the underlying buffer and doesn't make a copy of the data.

The difference between BytesIO.read and buf is that, subsequent data retrieves will not be affected when io.BytesIO.read is used as you will get a copy of the data of the buffer, but if you change data bufyou also will change the data in the buffer as well.

In terms of performance, using obj.read would be a better choice if you want to read the data in chunks, because it provides a clear separation between the data and the buffer, and makes it easier to manage the buffer. On the other hand, if you want to modify the data in the buffer, using buf would be a better choice because it provides direct access to the underlying data.

GIZ
  • 4,409
  • 1
  • 24
  • 43
  • How do you know that `read` returns a copy? Can you tell where the documentation says so? – Kelly Bundy Feb 04 '23 at 14:20
  • I don't understand your performance argument. How does that separation and that easier managing make chunk-reading more *performant*? – Kelly Bundy Feb 04 '23 at 14:22
  • @KellyBundy The `read` method returns a specified number of bytes from the memory and updates the position of the buffer to reflect the bytes that have been read. If you run `fseek(0)` and re-read bytes again from the buffer, the same data that have been read previously exists. The new bytes objects read from the buffer do not map directly to the ones you read previously. That is, if you change the bytes you read from the `read` method, you will not change the ones that reside in the buffer, effectively, you're getting a copy of the data. – GIZ Feb 04 '23 at 15:10
  • @KellyBundy `read` method returns a copy of the data, copying a large amount of data adds an overhead to your reading process. Whereas, memoryview objects do not copy the data and are mutable so you can update them in-place _without_ copying your data. This is why memoryviews are performance efficient when reading large bytes objects compared to `read`. Hopefully you got the point now. – GIZ Feb 04 '23 at 15:15
  • About `read` copying: But how do you know that? Where is that specified? Also, how do you *"change the bytes you read from the `read `method"*? – Kelly Bundy Feb 04 '23 at 15:20
  • About performance: I asked why you're saying that `read` is better for performance and now you're saying memoryview is better for performance. Which one is it? – Kelly Bundy Feb 04 '23 at 15:22
  • @KellyBundy The Python documentation doesn't specify that explicitly, but you can verify it by looking at the Python standard implementation of the `read` method or experimenting with those objects. Regarding which one is better than the other `read` or `buf` I mentioned that `read` for reading _chunks_ of data and `buf` for accessing the memorview of the buffer directly which is fast and efficient for large binary processing with a linear time complexity. I'm going to look for a question here that may be useful to you about memoryview objects. – GIZ Feb 04 '23 at 15:31
  • Look at this question: [What exactly is the point of memoryview in Python?](https://stackoverflow.com/questions/18655648/what-exactly-is-the-point-of-memoryview-in-python) – GIZ Feb 04 '23 at 15:35
  • Copy: So it's an implementation detail and for example PyPy might do it differently (not copy)? – Kelly Bundy Feb 04 '23 at 15:37
  • Performance: yes, *chunks* is what I asked about. I don't see why you're saying `read` is more performant for that. – Kelly Bundy Feb 04 '23 at 15:39
  • @KellyBundy Implementations could possibly vary, I have no idea how PyPy implements its `read` methods, but I would assume similarly it copies data. It's also worth noting that performance will depend on the size of the `BytesIO` object and the size of the chunks you are reading. For very large `BytesIO` objects, reading smaller chunks might be more memory efficient, while for smaller objects, reading larger chunks might be faster. – GIZ Feb 04 '23 at 15:43
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251621/discussion-between-giz-and-kelly-bundy). – GIZ Feb 04 '23 at 15:48