0

I write some code

batch = np.ones([4, 3, 224, 224], dtype="float32")
import time
s = time.time()
batch_bytes = batch.tobytes()
e = time.time()
print(f"{(e-s)*1e3} ms")

this gives output of 2.2954940795898438 ms

Seems a not small cost, I guess this method makes a copy of data bytes?

I used to think that the data are stored in bytes in memory so that there is a method to directly get it?

So is it possible to get the bytes more efficiently?

Litchy
  • 623
  • 7
  • 23
  • What are you hoping to do with these bytes? – hpaulj Jan 15 '22 at 17:29
  • @hpaulj One scenario is we are going to send these bytes(for serialization) through socket to other machines, we need only read the bytes and no need to operate anymore. In this case we hope to get it efficiently without copy. – Litchy Jan 16 '22 at 02:50
  • Might lookup `memoryview` – hpaulj Jan 16 '22 at 03:13
  • `memoryview` provides internal access of object like `bytes_array[x:y]` which means we still need to get the `bytes` at first?... Unless we can get the bytes from `memoryview(ndarray_object)`, does `ndarray` support the buffer protocal of `memoryview`? – Litchy Jan 16 '22 at 03:23
  • 1
    Try the `data` attribute, It's a `memoryview`. https://stackoverflow.com/questions/69544408/numpy-array-get-the-raw-bytes-without-copying. I haven't used memoryview, but have seen some SO about it. – hpaulj Jan 16 '22 at 04:41
  • @hpaulj This is great. I have tried it and we can use `base64.b64encode(array.data)`, for example base64 encoding. (same for other byte like operations). In this way there is no copy. – Litchy Jan 16 '22 at 06:32

3 Answers3

1

Yes the ndarray.tobytes() creates a copy of the data and stores it in a different place in your computer's memory. This is also described in the NumPy's documentation https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tobytes.html

You can easily test this out by printing the memory address of your objects.

import numpy as np
import time

batch = np.ones([4, 3, 224, 224], dtype="float32")
s = time.time()
batch_bytes = batch.tobytes()
e = time.time()
print(f"{(e-s)*1e3} ms")

print(f"Batch object address:       {hex(id(batch))}")
print(f"batch_bytes object address: {hex(id(batch_bytes))}")

Gives output of:

Batch object address:       0x7f16beab0990
batch_bytes object address: 0x7f16be491010
  • 2
    Having different object ID is normal because they are of different type. This does not prove that objects are internally referencing/sharing on the same buffer or not (although in practice a copy is done). Indeed, two Numpy arrays can reference the same memory buffer while having different object ID. In fact, this is quite frequent when using the `reshape` function or Numpy views. – Jérôme Richard Jan 15 '22 at 15:13
1

Yes it makes a copy because the bytes type must have the ownership of its raw data (ie. a copy is mandatory). However, you can make a view of the Numpy array without any copy using:

batch_bytes = batch.reshape(-1).view(np.uint8)

Note that the resulting type if different (a 1D Numpy array).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • A great explanation about `view` and `id()`. BTW what about the second question? So are the data stored in bytes in memory(for example I want to serialize the array by `tobytes`)? Maybe I mixed the concept of `byte` type and bytes in memory? – Litchy Jan 16 '22 at 02:58
  • If you want specifically a `bytes` then, a copy is mandatory as stated in the answer. On my machine, the copy succeed to reach 66 GiB/s in the L3 cache which is very good for a *sequential* python function. So, no, I do not think this is possible to get an object of type `bytes` more efficiently than `batch.tobytes()`. That being said, you may not actually need this `bytes` object regarding what you do next with it. – Jérôme Richard Jan 16 '22 at 14:49
0

A same question is posted here: Numpy array: get the raw bytes without copying

to get bytes from an array:ndarray, use array.data would get an memoryview(reference) of bytes

Litchy
  • 623
  • 7
  • 23