To understand if there a more efficient way, one first need to understand why the current code is slow.
Numpy methods creating arrays like np.zeros
or np.empty
(or any method creating temporary array like multiplication, addition etc.) request a memory buffer from the CPython allocator which forward it to the default libc allocator (which is different from the one of the OS) or a custom allocator if any. np.zeros
request a buffer pre-filled with zeros while np.empty
just request a raw buffer.
The default allocator behave differently regarding the platform (mainly the operating system). On Windows, it requests memory to the OS and free it systematically for big buffer while the default memory allocators of Mac and Linux tends to be more conservative: they keep pretty big local chunks of memory and try to reuse them as much as possible rather than releasing the space to the OS.
This default policy has a drastic impact on performance and memory usage. Indeed, the allocator needs to fill all the values to 0 when a zero-filled memory buffer is requested from Numpy and the buffer is recycled from a previously allocated space (not yet released to the OS). However, when a zero-filled memory is directly requested from the OS, then the OS can return a virtual memory buffer that will be filled lazily only when a first-touch is performed on specific memory pages. This means the allocation can be much faster for huge array but the overhead of filling the array with zeros is delayed. In the end, the overhead of filling the array will be paid as long as all pages are read/written (ie. the array is completely read or written with some values). Actually, this lazy memory filling is more expensive than if the buffer would be recycled by the allocator due to page-faults. Some OS prefill memory chunks (possibly in separate threads) to speed up such zero-filled buffer requests. As a result, you should be very careful about the way you benchmark your application.
In practice, the memory requested to the OS is always filled with zeros on mainstream platforms (by default on Windows, Linux and Mac) because of security reasons: the memory previously allocated, filled and released by a process must not be accessible from another process since the memory chunks can contains sensitive information (for example your browser can store password in memory and you do not expect Numpy python script to be able to read them without any privileges). This zero filling is generally done at page-fault time. Thus, calling np.empty
or np.zeros
gives the same result when the array is requested from the OS. However, when the array is recycled by the allocator, then np.empty
can be much faster and there is (generally) no page-fault overhead to pay (page-faults are done once per page as long as the memory pages are not stored somewhere else like in swap when you run out of memory).
Put it shortly, there is no way (only from Python) to speed up the creation of an array as long as you request the creation of a new array and you read/write all the target array. Using a custom system allocator does not help much since the array have to be filled anyway. If it is Ok for you to pay the overhead progressively, then you need to using a manual memmap. Otherwise you can preallocate some buffer and recycle them yourself. It can be faster because you may not need to fully fill them to zero and you will not pay the cost of pages faults. There is no free lunch.
Related posts: