7

When measuring memory consumption of np.zeros:

import psutil
import numpy as np

process = psutil.Process()
N=10**8
start_rss = process.memory_info().rss
a = np.zeros(N, dtype=np.float64)
print("memory for a", process.memory_info().rss - start_rss)

the result is unexpected 8192 bytes, i.e almost 0, while 1e8 doubles would need 8e8 bytes.

When replacing np.zeros(N, dtype=np.float64) by np.full(N, 0.0, dtype=np.float64) the memory needed for a are 800002048 bytes.

There are similar discrepancies in running times:

import numpy as np
N=10**8
%timeit np.zeros(N, dtype=np.float64)
# 11.8 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.full(N, 0.0, dtype=np.float64)
# 419 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I.e. np.zeros is up to 40 times faster for big sizes.

Not sure these differences are for all architectures/operating systems, but I've observed it at least for x86-64 Windows and Linux.

Which differences between np.zeros and np.full can explain different memory consumption and different running times?

ead
  • 32,758
  • 6
  • 90
  • 153
  • might also be interesting to add [`np.ones`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ones.html#numpy.ones) to the mix. – Ma0 Mar 11 '20 at 16:34
  • 1
    `np.zeros` in conjunction with the operating system is doing a 'lazy' initialization: `https://stackoverflow.com/questions/27574881/why-does-numpy-zeros-takes-up-little-space – hpaulj Mar 11 '20 at 18:27
  • Short version: The answer is the same as the answer to [Why malloc+memset is slower than calloc?](https://stackoverflow.com/q/2688466/364696). `calloc`/`numpy.zeros` doesn't actually *write* the memory at all at allocation time (for large enough allocations), `malloc`+`memset`/`numpy.full` does (in theory, it could special case `full` for zero values; in practice it appears not to). – ShadowRanger Mar 11 '20 at 20:31

2 Answers2

3

I don't trust psutil for these memory benchmarks, and rss (Resident Set Size) may not be the right metric in the first place.

Using stdlib tracemalloc you can get correct looking numbers for memory allocation - it should be approx an 800000000 bytes delta for this N and float64 dtype:

>>> import numpy as np
>>> import tracemalloc
>>> N = 10**8
>>> tracemalloc.start()
>>> tracemalloc.get_traced_memory()  # current, peak
(159008, 1874350)
>>> a = np.zeros(N, dtype=np.float64)
>>> tracemalloc.get_traced_memory()
(800336637, 802014880)

For the timing differences between np.full and np.zeros, compare the man pages for malloc and calloc, i.e. the np.zeros is able to go to an allocation routine which gets zeroed pages. See PyArray_Zeros --> calls PyArray_NewFromDescr_int passing in 1 for the zeroed argument, which then has a special case for allocating zeros faster:

if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) {
    data = npy_alloc_cache_zero(nbytes);
}
else {
    data = npy_alloc_cache(nbytes);
}

It looks like np.full does not have this fast path. There the performance will be similar to first doing an init and then doing a copy O(n):

a = np.empty(N, dtype=np.float64)
a[:] = np.float64(0.0)

numpy devs could presumably have added a fast path to np.full if the fill value was zero, but why bother to add another way to do the same thing - users could just use np.zeros in the first place.

wim
  • 338,267
  • 99
  • 616
  • 750
  • I trust more psutils than tracemalloc (https://stackoverflow.com/q/50148554/5769463) even if for numpy the memory usage should be recorded correctly. However, tracemalloc doesn't show the really commited memory - I think OS knows best how much memory is really in use. – ead Mar 11 '20 at 18:09
  • I also doubt there is a much faster way to zero memory than used by np.full (python/numpy overhead doesn't play a huge role here) - otherwise np.full would use it. In your measurements, np.zeros would zero memory with about 53333GB/s which is not possible. – ead Mar 11 '20 at 18:09
  • Hmm, you are right about `np.zeros`, there must be something else going on here - I'll delete that part of my answer and keep investigating. I should have checked the absolute numbers more carefully. Maybe some lazy init hidden somewhere. – wim Mar 11 '20 at 19:28
  • @wim: It's almost certain that the large pre-zeroed allocation is faster because it's relying on the OS providing pre-zeroed pages. The same way `calloc` can be `malloc`+`memset` for small allocations, but plain `malloc` for large allocations (large enough to make a dedicated request for memory from the OS, e.g. via Windows' `VirtualAlloc` or UNIX-like's anonymous `mmap`) with no zeroing step. On Linux, those pages are copy-on-write mappings of the zero page, which means the first write to each page is more expensive; `np.full` would force the early copy, and have consistent write performance. – ShadowRanger Mar 11 '20 at 20:21
  • @wim: Yeah, just checked. If there's no cached zeroed data in the bucketed free list, `npy_alloc_cache_zero` defers to `PyDataMem_NEW_ZEROED`. `PyDataMem_NEW_ZEROED`'s primary work is done via a `calloc` call, which has the optimizations I mentioned on most allocators. – ShadowRanger Mar 11 '20 at 20:24
  • @ShadowRanger That looks correct on Linux. Not sure if such a magic trick ("here's one I've prepared earlier") is avail on other platforms, though. – wim Mar 11 '20 at 20:24
  • @wim: Aside from embedded systems, all OSes I know of provide zeroed memory from the OS-specific allocation methods (that `malloc`/`calloc` use directly when large requests are made). It's a security measure; they don't want data left in the memory from the last process that had it. Whether its lazy zeroing (page is COW of zero page, copied immediately before write) or eager (kernel thread zeroes pages, with allocations blocking if not enough available) differs from OS to OS, but it's always zeroed. See [Why malloc+memset is slower than calloc?](https://stackoverflow.com/q/2688466/364696) – ShadowRanger Mar 11 '20 at 20:27
  • @ShadowRanger Interesting. Thank you! – wim Mar 11 '20 at 20:33
1

The numpy.zeros function straight uses the C code layer of the Numpy library while the functions ones and full works as same by initializing an array of values and copying the desired value in it.

Then the zeros function doesn't need any language interpretation while for the others, ones and full, the Python code need to be interpreted as C code.

Have a look on the source code to figure it out by yourself: https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

Laurent GRENIER
  • 612
  • 1
  • 6
  • 13