0

I am trying to understand what's the size difference between a numpy masked array and a normal array with nans.

import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
mask =  np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)

I used the following answer to compute the size of the object:

import sys
from types import ModuleType, FunctionType
from gc import get_referents
​
# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType
​
​
def getsize(obj):
    """sum size of object & members."""
    if isinstance(obj, BLACKLIST):
        raise TypeError('getsize() does not take argument of type: '+ str(type(obj)))
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += sys.getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size

That gives me the following result:

getsize(g)
>>>200000112
getsize(g_mask)
>>>25000924

Why the unmasked array is bigger compared to the masked array? How can I estimate the real size of the masked array vs the unmasked array?

G M
  • 20,759
  • 10
  • 81
  • 84

2 Answers2

1

numpy.ndarray has no tp_traverse, so it's incompatible with the getsize function you're trying to use. The GC system can't see the references owned by the ndarray part of your masked array. Particularly, the base of g_mask is not getting included in your output.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Thanks, any idea on how can I get the size of the two objects for a comparison? – G M Nov 02 '19 at 22:23
1
In [23]: g = np.random.random((5000,5000)) 
    ...: indx = np.random.randint(0,4999,(500,2)) 
    ...: mask =  np.full((5000,5000),False,dtype=bool) 
    ...: mask[indx] = True 
    ...: g_mask = np.ma.array(g,mask=mask)    

Comparing the g array with the _data attribute of g_mask, we see that the latter is just a view of the former:

In [24]: g.__array_interface__                                                  
Out[24]: 
{'data': (139821997776912, False),
 'strides': None,
 'descr': [('', '<f8')],
 'typestr': '<f8',
 'shape': (5000, 5000),
 'version': 3}
In [25]: g_mask._data.__array_interface__                                       
Out[25]: 
{'data': (139821997776912, False),
 'strides': None,
 'descr': [('', '<f8')],
 'typestr': '<f8',
 'shape': (5000, 5000),
 'version': 3}

They have the same data buffer, but their id is different:

In [26]: id(g)                                                                  
Out[26]: 139822758212672
In [27]: id(g_mask._data)                                                       
Out[27]: 139822386925440

Same for the mask:

In [28]: mask.__array_interface__                                               
Out[28]: 
{'data': (139822298669072, False),
 'strides': None,
 'descr': [('', '|b1')],
 'typestr': '|b1',
 'shape': (5000, 5000),
 'version': 3}
In [29]: g_mask._mask.__array_interface__                                       
Out[29]: 
{'data': (139822298669072, False),
 'strides': None,
 'descr': [('', '|b1')],
 'typestr': '|b1',
 'shape': (5000, 5000),
 'version': 3}

Actually with this construction, the _mask is the same array:

In [30]: id(mask)                                                               
Out[30]: 139822385963056
In [31]: id(g_mask._mask)                                                       
Out[31]: 139822385963056

__array_interface__ of the masked array is that of the ._data attribute:

In [32]: g_mask.__array_interface__                                             
Out[32]: 
{'data': (139821997776912, False),

nbytes is the size of the data buffer for an array:

In [34]: g_mask.data.nbytes                                                     
Out[34]: 200000000
In [35]: g_mask.mask.nbytes                                                     
Out[35]: 25000000

A boolean array has 1 byte per element, and a float64, 8 bytes.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks any idea on how I can estimate the size correctly? – G M Nov 02 '19 at 23:15
  • What's missing in my description? `nbytes`? Masking adds the size of the `mask`. – hpaulj Nov 02 '19 at 23:17
  • Yes, I don't understand how can I get the size in bytes of the two different objects. What I am trying to understand if the masked array consume a lot more memory compared to the unmasked array. – G M Nov 02 '19 at 23:28
  • I tried to explain. The masked array is actually 2 arrays - sum the size of the two. Do you understand how numpy arrays are stored? And the difference between a `copy` and `view`? – hpaulj Nov 02 '19 at 23:43
  • Yes I know what a view is, I simply did not know that with data.nbytes you could have the bytes. g_mask.mask.nbytes is exactly the number of values of the bytes of the mask I was wondering if there was also some other bytes used by the mask class. – G M Nov 03 '19 at 15:23