Overhead of loading large numpy arrays

Question

My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.

How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?

In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.

All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.

Loading the data from hard disk is much slower than putting the data into ram. Therefore, the time spent on putting the data into ram is not important. — Crawl Cycle, Nov 11 '20 at 15:54
numpy arrays usually have a header and a block of data memory. The header is small and constant size, and, to answer your question, for simple numpy dtypes the size of the block of data can be directly calculated as (the number of bytes per element)*(the number of elements). That is, to say how large the array `(x, 1323000, 1)` you need to know is the type (`np.float`, `np.int`, `np.float32`, etc) and the number of bytes used for that type (eg, use `finfo` or `iinfo`)? — tom10, Nov 11 '20 at 16:01
The memory use of a numpy array is easy to estimate, `x*1323000*8` - the total number of elements times the size of each (typically 8 bytes). Overhead is tiny. However, use of the array(s) may produce copies, permanent or temporary. So in practice you probably need 2-3x as much memory. — hpaulj, Nov 11 '20 at 16:35

Overhead of loading large numpy arrays

0 Answers0