15

I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.

>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776

From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.

Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.

>>> for i in range(1024):
...   for j in range(1024):
...     for k in range(1024):
...         A[i,j,k] = np.random.randint(32767, dtype = np.int16)

The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!

So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?

lynn
  • 536
  • 2
  • 5
  • 16
  • 1
    1. I would suspect for both memory efficiency purposes and for speed. 2. You can initialize a numpy array with random numbers if you want, simply `A=np.random.randn(1024,1024,1024)`. Not sure why you would want to force numpy to do this though. – enumaris Jul 12 '18 at 20:54
  • 4
    [That's normal `calloc` behavior.](https://stackoverflow.com/questions/44487786/performance-of-zeros-function-in-numpy) I don't see *why* you're assuming this would be any faster if you forced the system to "really" allocate all that memory up front. – user2357112 Jul 12 '18 at 20:54
  • 1
    Oh, that makes a lot of sense after googling it. I wasn't familiar with calloc before now. I had assumed Numpy had stored each array in an intelligent way, replacing it with an actual array when requested. – lynn Jul 12 '18 at 21:01
  • 3
    Forcing it to allocate memory in advance will probably be slower, not faster. There are occasionally reasons to do so, but they're mostly related to cases where you're pushing the limits of your RAM and fighting the OS's overcommit, or cases where you're trying to get more detailed platform-specific benchmarks or profiling, etc., and usually you end up having to do something low-level and platform-specific anyway. – abarnert Jul 12 '18 at 21:01
  • 2
    If you _do_ need to do this, what you generally want is to manually create an (anonymous or disk-backed) `np.memmap` or a `mmap.mmap`, `MADV_WILLNEED` and `MADV_SEQUENTIAL` it (or the relevant equivalent for your platform), and then make an array using the map for storage. This still doesn't _force_ the kernel to allocate the memory the way you want, but it strongly encourages it to do so. – abarnert Jul 12 '18 at 21:07
  • This makes sense! I appreciate the pointers here, this discussion answers my questions – lynn Jul 12 '18 at 21:10

2 Answers2

3

A neat answer to your first question can also be found in this StackOverflow answer.

To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:

A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)

because then the memory is touched. At my machine with my setup,

A = np.empty(0)
A.resize((1024, 1024, 1024))

also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.

mutableVoid
  • 1,284
  • 2
  • 11
  • 29
  • 1
    This should be the accepted answer. Allocating memory does not create the memory pages, it only reserves an address range. The pages containing that memory only get instantiated when that address range is actually touched. This has nothing to do with Numpy and everything with the operating system. – Victor Eijkhout Nov 21 '21 at 15:27
1

Let's look at some timings for a smaller case:

In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)

We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.

In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now compare that time to the alternative of generating the random numbers with one call:

In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Much much faster.

If we do the same loop, but simply assign the random number to a variable (and throw it away):

In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.

I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.

Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:

In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.

(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).

hpaulj
  • 221,503
  • 14
  • 230
  • 353