Memory computation and sparse matrices for 3D arrays in Python

Question

I have a twofold problem concerning memory computation and sparse matrices for 3D arrays in Python:

I have various sparse representations (of 1s and 0s) of 3D numpy arrays. Meaning an array looks like this:

A =

[[[0. 1. 1.]
  [1. 0. 1.]
  [1. 0. 0.]
  ...
  [1. 0. 0.]
  [1. 0. 0.]
  [1. 0. 0.]]

 ...

 [[1. 0. 1.]
  [0. 1. 0.]
  [1. 0. 1.]
  ...
  [0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]]]

and array.shape is equal to (x, y, 3).

I would like to find a way to (1) measure the array's memory, then (2) store it as a sparse matrix/array (using sth similar to scipy's csr_matrix), then (3) measure the memory of the sparse matrix/array to (hopefully) see an improvement in memory.

My first problem is that I generally have trouble with python's memory measurement solutions I have found so far, for example, I expected to see a difference in memory size when I take an array of floats of many decimal points (eg B = [[[0.38431373 0.4745098 0.6784314 ] [0.41963135 0.49019608 0.69411767] [0.40392157 0.49019608 0.6862745 ] ...]]]) and an array of 1.s & 0.s of the same size (like array A) which should have shown a big improvement (I need to measure this difference as well). Yet, python reports the same memory size for arrays of the same shape. I am listing the methods I used here and their outputs:

print(sizeof(A))   #prints 3783008

asizeof.asizeof(A)   #prints 3783024

print(actualsize(A))  #prints 3783008

print(A.nbytes)  #prints 3782880

print(total_size(A))  #prints 3783008

getsize(A)  #prints 3783008

print(len(pickle.dumps(A)))  #prints 3783042

********************

print(asizeof.asizeof(B)) #prints 5044112

sys.getsizeof(B) #prints 128 !!!

print(sizeof(B))  #prints 128 !!!

print(actualsize(B))  #prints 128 !!!

print(total_size(B))   #prints 128 !!!

print(B.nbytes)  #prints 3782880 

getsize(B)  #prints 128 !!!

print(len(pickle.dumps(B)))  #prints 3783042

(methods collected from here, here, and here).

My second problem is that I cannot find an economical way to store a matrix (of a certain sparsity) as a sparse matrix for 3D arrays: Scipy's csr_matrix and pandas' SparseArray works for 2D arrays only, and sparse.COO() is very costly for 3D - it starts to help with memory for sparsities of ~80% and higher. For example, a 70% sparse array stored with sparse.COO() is about 8M bytes big (e.g. using pickle), which is much bigger than the actual array. Or maybe the problem is still the way I compute memory (see methods listed in the examples above).

Any ideas of what I should do? I am really sorry this post is too long! Thank you in advance!

Why would you expect an array of floats to be different sizes based on what the values of the floats are? That's not how floats work - if you really want, you can [pack arrays of 1s and 0s to be single bit values](https://stackoverflow.com/questions/5602155/numpy-boolean-array-with-1-bit-entries) though. — CJR, May 24 '22 at 14:07
Is your `B` a `view` of another array? `nbytes` is normally enough, and is just the product of dimensions and dtype size. — hpaulj, May 24 '22 at 14:36
For COO sparse formats, each nonzero element requires its data (determined by `dtype`), plus a coordinate values (usually stored as `int64)`. `scipy.sparse` uses the 3 arrays to store these values - data, rows, and columns. For a 3d version you won't get any space savings until `nnz` is less than 25%. — hpaulj, May 24 '22 at 15:30
Those variants on `getsizeof` help when dealing with objects that contain references to other objects,, such as list and dict. For numpy arrays they aren't needed. The main memory use is the data buffer, which stores all elements as bytes (normally 8 bytes per element). So `nbytes` is just the total number of elements times 8. Obviously you need to be aware of whether the array is a `view` or not. Similarly for sparse matrices, you need to understand how the data is stored. — hpaulj, May 24 '22 at 15:36
@CJR Thank you for your comment! I naively expected that a same shape array with floats of many decimals places would use more memory than an array (of floats) with no decimals at all. This is why I was looking for a way (memory wise) that could capture this reduction in digits. I also failed to mention that these arrays represent RGB images. I cannot convert 1.s and 0.s to 1s and 0s because ```plt.imshow()``` only works with [0. 1.] or [0 255] values. (an array of 1s and 0s would print a black image). — Amadeo Amadei, May 24 '22 at 16:21
@hpaulj Thank you for your comment! ```B``` is not a ```view``` of another array (if understand what you mean). It is a completely different array with different values, but with the same dimensions (x, y, 3) as ```A```. — Amadeo Amadei, May 24 '22 at 18:21
@hpaulj Thank you for your insightful comments on COO sparse formats and ```getsizeof```! — Amadeo Amadei, May 24 '22 at 18:22
`getsizeof(B)` of 128 means it sees the 'object' cover, the part with shape, strides, etc, but doesn't take into account the data-buffer, which is what `nbytes` counts. That's why I think it's a `view` of something else, `B.base`. — hpaulj, May 24 '22 at 18:43

Memory computation and sparse matrices for 3D arrays in Python

0 Answers0