Speed up the initialization of 3D matrices in Numpy

Question

I recently noticed a significant slowdown when working on freshly initialized Numpy arrays compared to already initialized arrays. Generally it seems logical that it takes longer to initialize the array, but I didn't expect such a big difference. The snippet is a schematic part of a function I need and just these two lines to create dim3 take about half of the total runtime of the function.

import numpy as np

mask = np.where(np.random.rand(150,150) > 0.98)
very_important_data = np.random.rand(len(mask[0]), 1000)

dim3 = np.zeros((150,150,1000))

%timeit dim3[mask] = very_important_data    # --> 114 µs ± 5.24 µs per loop

%timeit dim3 = np.zeros((150,150,1000)); dim3[mask] = very_important_data   # --> 9.4 ms ± 585 µs per loop

Is there a more efficient way to pre-initialize the dim3-matrix? Or an efficient way to keep a matrix in memory that is set to zero before the new values are assigned?

Thanks!

`%timeit dim3 = np.zeros((150,150,1000));` gives 13.5 µs ± 497 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) and `%timeit dim3[mask] = very_important_data` give 305 µs ± 5.89 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each). I don't understand why timeit both together I get `%timeit dim3 = np.zeros((150,150,1000)); dim3[mask] = very_important_data` 3.17 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each). It is almost 3ms to do nothing. Is it an issue with `timeit` itself? — ymmx, May 31 '22 at 14:06
no it's because you're timing ```np.empty``` not ```np.zeros``` — Nin17, May 31 '22 at 14:09
are you sure? because I get ```%timeit np.empty((150, 150, 1000)) %timeit np.zeros((150, 150, 1000))``` output: ```863 ns ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 4.93 ms ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)``` — Nin17, May 31 '22 at 14:14
Yes, I did it again. I get 13.4 µs + 293 µs using `%timeit dim3 = np.zeros((150,150,1000))` and `%timeit dim3[mask] = very_important_data` separately but I get 3.4 ms if I timeit both together `%timeit dim3 = np.zeros((150,150,1000)); dim3[mask] = very_important_data` — ymmx, May 31 '22 at 14:18
Which platform do you use? Besides, filling a buffer of 171 MiB with zeros can takes some time. Why do you want to speed up the initialization? Is this on the critical path of your application execution? If so, why do you even initialize such a big array on the critical path? — Jérôme Richard, May 31 '22 at 15:38
This little snippet is part of an ML preprocessing pipeline and unfortunately needs to be executed. I use OpenSuse Linux and it looks like the fastest way to get a 0 matrix is to request a new one from the OS. In case I recycle a matrix over and over again, I would have to fill it with zeros at the beginning, unfortunately this also takes unexpectedly long via Numpy. `%timeit dim3.fill(0) # --> 10.7 ms ± 111 µs per loop` — chrx, Jun 01 '22 at 07:13

score 1 · Accepted Answer · answered May 31 '22 at 18:17

To understand if there a more efficient way, one first need to understand why the current code is slow.

Numpy methods creating arrays like np.zeros or np.empty (or any method creating temporary array like multiplication, addition etc.) request a memory buffer from the CPython allocator which forward it to the default libc allocator (which is different from the one of the OS) or a custom allocator if any. np.zeros request a buffer pre-filled with zeros while np.empty just request a raw buffer.

The default allocator behave differently regarding the platform (mainly the operating system). On Windows, it requests memory to the OS and free it systematically for big buffer while the default memory allocators of Mac and Linux tends to be more conservative: they keep pretty big local chunks of memory and try to reuse them as much as possible rather than releasing the space to the OS.

This default policy has a drastic impact on performance and memory usage. Indeed, the allocator needs to fill all the values to 0 when a zero-filled memory buffer is requested from Numpy and the buffer is recycled from a previously allocated space (not yet released to the OS). However, when a zero-filled memory is directly requested from the OS, then the OS can return a virtual memory buffer that will be filled lazily only when a first-touch is performed on specific memory pages. This means the allocation can be much faster for huge array but the overhead of filling the array with zeros is delayed. In the end, the overhead of filling the array will be paid as long as all pages are read/written (ie. the array is completely read or written with some values). Actually, this lazy memory filling is more expensive than if the buffer would be recycled by the allocator due to page-faults. Some OS prefill memory chunks (possibly in separate threads) to speed up such zero-filled buffer requests. As a result, you should be very careful about the way you benchmark your application.

In practice, the memory requested to the OS is always filled with zeros on mainstream platforms (by default on Windows, Linux and Mac) because of security reasons: the memory previously allocated, filled and released by a process must not be accessible from another process since the memory chunks can contains sensitive information (for example your browser can store password in memory and you do not expect Numpy python script to be able to read them without any privileges). This zero filling is generally done at page-fault time. Thus, calling np.empty or np.zeros gives the same result when the array is requested from the OS. However, when the array is recycled by the allocator, then np.empty can be much faster and there is (generally) no page-fault overhead to pay (page-faults are done once per page as long as the memory pages are not stored somewhere else like in swap when you run out of memory).

Put it shortly, there is no way (only from Python) to speed up the creation of an array as long as you request the creation of a new array and you read/write all the target array. Using a custom system allocator does not help much since the array have to be filled anyway. If it is Ok for you to pay the overhead progressively, then you need to using a manual memmap. Otherwise you can preallocate some buffer and recycle them yourself. It can be faster because you may not need to fully fill them to zero and you will not pay the cost of pages faults. There is no free lunch.

Speed up the initialization of 3D matrices in Numpy

2 Answers2

Linked