0

I have the following dilemma. I am trying to pickle and then unpickle a numpy array that represents an image.

Executing this code:

a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(sys.getsizeof(a1), a1.shape)

a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))

a3 = pickle.loads(a2)
print(sys.getsizeof(a3), a3.shape)

Produces this output:

6220928 (1080, 1920, 3)
6220995 <class 'bytes'>
128 (1080, 1920, 3)

Now, a1 is thus around 6 MB, a2 is the pickle representation of a1 and is a bit longer but still roughly the same. And then I try to unpickle a2 and I get... something obviously not right.

a3 looks fine, i can call methods, I can assign values to it's cells etc.

The result is the same if I replace pickle calls with a1.dumps and np.loads since these just call pickle.

So what exactly is the deal with the weird size?

Rares Dima
  • 1,575
  • 1
  • 15
  • 38
  • `sys.getsizeof` is not the right function to use. Try the `nbytes` property of the array. In your case `a1.nbytes`. See https://docs.python.org/3/library/sys.html#:~:text=All%20built%2Din%20objects%20will%20return%20correct%20results%2C%20but%20this%20does%20not%20have%20to%20hold%20true%20for%20third%2Dparty%20extensions%20as%20it%20is%20implementation%20specific – Michael Sohnen Oct 27 '22 at 22:45
  • `getsizeof` is a tricky tool to use correctly. It is better for numpy arrays than lists, but still you can get unexpected values. Here I suspect `a3` is a `view` of something else. For example `loads` might have created a 1d array, and then reshaped it. `sys.getsizeof(a3.base)` might give an expected size. – hpaulj Oct 27 '22 at 22:52

3 Answers3

1

It's because the a3 object does not own the ndarray memory but point it to the a3.base. Thus the sys.getsizeof(a3) won't report the a3.base memory size.

In contrary, a1 object does own it's memory (because a1.base is None, please check the explanation of .base of ndarray by saying help(a1)). Thus the sys.getsizeof(a1) report the memory size including the whole array.

import numpy as np
import sys
import pickle

a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(sys.getsizeof(a1), a1.shape, type(a1))
if a1.base is None: 
    print("a1.base==None, The object a1 owns its memory. Thus the size of a1 is ",sys.getsizeof(a1))

a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))

a3 = pickle.loads(a2)
print(sys.getsizeof(a3), a3.shape, type(a3))
if a3.base is not None:
    print("a3.base is not None, The object a3 does not own its memory. Thus the size of a3 is ",sys.getsizeof(a3))

help(a1)

see more numpy memory usage discussion here.

So depends on what you want to achieve, sometime sys.getsizeof() may not get you an intuitive result. It's primarily depends on what you mean by "object storage".

Paul Wang
  • 1,666
  • 1
  • 12
  • 19
0

From the sys.getsizeof docs:

Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.

Emphasis mine. Basically there's no guarantee that sys.getsizeof will give you consistent or correct values for numpy objects.

Woodford
  • 3,746
  • 1
  • 15
  • 29
0

Making the arrays and dump. It doesn't have to be big.

In [15]: a1 = np.zeros((10,20,30)); a2 = pickle.dumps(a1); a3 = pickle.loads(a2)

nbytes match, as does shape

In [16]: a1.nbytes, a3.nbytes
Out[16]: (48000, 48000)    
In [17]: a1.shape, a3.shape
Out[17]: ((10, 20, 30), (10, 20, 30))

In [18]: type(a2)
Out[18]: bytes
In [19]: len(a2)
Out[19]: 48154

Since the getsizeof for a3 is so small, I suspect it's a view of something. That is, getsizeof does not 'see' its databuffer.

If an array has its own data, the base will be None. Or it may be a view of another array. But apparently loads has constructed this array by referencing a bytes object:

In [20]: type(a3.base)
Out[20]: bytes    
In [21]: len(a3.base)
Out[21]: 48000

That looks like a2 without some sort of information header.

Anyways, getsizeof is not that useful when examining arrays - or lists.


Here's a simpler case, with a common test array:

In [22]: x = np.arange(12).reshape(3,4)
In [23]: sys.getsizeof(x)
Out[23]: 120
In [24]: sys.getsizeof(x.base)
Out[24]: 152
In [25]: x.base
Out[25]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

x is actually a view of the array created by arange.

edit

I suspect a3 is created with something like:

a4 = np.ndarray(a1.shape, buffer=a2[154:])
a5=np.frombuffer(a2, dtype='float', offset=154).reshape(a1.shape)

Another note - pickle lets the object specify how it will be serialized. For ndarray that is done with the np.save code. The format is a header buffer followed by a copy of the array's databuffer.

In [35]: np.save('test.npy',a1)
In [38]: !dir test.npy
 Volume in drive C is Windows
 Volume Serial Number is 4EEB-1BF0

 Directory of C:\Users\paul

10/27/2022  05:43 PM            48,128 test.npy
               1 File(s)         48,128 bytes
hpaulj
  • 221,503
  • 14
  • 230
  • 353