Performance of np.empty, np.zeros and np.ones

Question

I was curious about how much difference it really made to use np.empty instead of np.zeros, and also about the difference with respect to np.ones. I run this small script to benchmark the time it took for each of these to create a large array:

import numpy as np
from timeit import timeit

N = 10_000_000
dtypes = [np.int8, np.int16, np.int32, np.int64,
          np.uint8, np.uint16, np.uint32, np.uint64,
          np.float16, np.float32, np.float64]
rep= 100
print(f'{"DType":8s} {"Empty":>10s} {"Zeros":>10s} {"Ones":>10s}')
for dtype in dtypes:
    name = dtype.__name__
    time_empty = timeit(lambda: np.empty(N, dtype=dtype), number=rep) / rep
    time_zeros = timeit(lambda: np.zeros(N, dtype=dtype), number=rep) / rep
    time_ones = timeit(lambda: np.ones(N, dtype=dtype), number=rep) / rep
    print(f'{name:8s} {time_empty:10.2e} {time_zeros:10.2e} {time_ones:10.2e}')

And obtained the following table as a result:

DType         Empty      Zeros       Ones
int8       1.39e-04   1.76e-04   5.27e-03
int16      3.72e-04   3.59e-04   1.09e-02
int32      5.85e-04   5.81e-04   2.16e-02
int64      1.28e-03   1.13e-03   3.98e-02
uint8      1.66e-04   1.62e-04   5.22e-03
uint16     2.79e-04   2.82e-04   9.49e-03
uint32     5.65e-04   5.20e-04   1.99e-02
uint64     1.16e-03   1.24e-03   4.18e-02
float16    3.21e-04   2.95e-04   1.06e-02
float32    6.31e-04   6.06e-04   2.32e-02
float64    1.18e-03   1.16e-03   4.85e-02

From this I extract two somewhat surprising conclusions:

There is virtually no difference between the performance of np.empty and np.zeros, maybe excepting some difference for int8. I don't understand why this is the case. Creating an empty array is supposed to be faster, and actually I have seen reports of that (e.g. Speed of np.empty vs np.zeros).
There is a great difference between np.zeros and np.ones. I suspect this has to do with high-performance means for memory zeroing that do not apply to filling a memory area with a constant, but I don't really know how or at what level that works.

What is the explanation for these results?

I am using NumPy 1.15.4 and Python 3.6 Anaconda on Windows 10 (with MKL), and I have a Intel Core i7-7700K CPU.

EDIT: As per a suggestion in the comments, I tried running the benchmark interleaving each individual trial and averaging at the end, but I couldn't see a significant difference in the results. On a related note, though, I don't know if there are any mechanisms in NumPy to reuse the memory of a just deleted array, which would make the measures unrealistic (although the times do seem to go up with the data type size even for empty arrays).

There is a comment by @PavenMinaev in [this post](https://stackoverflow.com/questions/1538420/difference-between-malloc-and-calloc) that an OS such as FreeBSD can zero out unused memory blocks when the CPU is idle so that when `calloc` is called it just returns one of these blocks without having to zero out a block on the spot. Not sure if something similar is happening for Windows 10. Also, to minimize the effect of the cache, maybe instead of running all the reps for the same dtype at one go before running those of another, run them interleaved, accumulate the time, and divide at the end. — lightalchemist, Mar 13 '19 at 15:47
_"reuse the memory of a just deleted array"_ I'm pretty sure it does. In a shell you can try something like `np.empty(4)` `np.ones(4)` `np.empty(4)`. From the second call to `empty` I get four ones, probably no coincidence. — Paul Panzer, Mar 13 '19 at 17:04

Paul Panzer · Answer 1 · 2019-03-13T16:10:36.540

This should really be a comment but it won't fit. Here is a small extension of your script. With some "hand-made" versions of zeros and ones.

import numpy as np
from timeit import timeit

N = 10_000_000
dtypes = [np.int8, np.int16, np.int32, np.int64,
          np.uint8, np.uint16, np.uint32, np.uint64,
          np.float16, np.float32, np.float64]
rep= 100
print(f'{"DType":8s} {"Empty":>10s} {"Zeros":>10s} {"Ones":>10s}')
for dtype in dtypes:
    name = dtype.__name__
    time_empty = timeit(lambda: np.empty(N, dtype=dtype), number=rep) / rep
    time_zeros = timeit(lambda: np.zeros(N, dtype=dtype), number=rep) / rep
    time_ones = timeit(lambda: np.ones(N, dtype=dtype), number=rep) / rep
    time_full_zeros = timeit(lambda: np.full(N, 0, dtype=dtype), number=rep) / rep
    time_full_ones = timeit(lambda: np.full(N, 1, dtype=dtype), number=rep) / rep
    time_empty_zeros = timeit(lambda: np.copyto(np.empty(N, dtype=dtype), 0), number=rep) / rep
    time_empty_ones = timeit(lambda: np.copyto(np.empty(N, dtype=dtype), 1), number=rep) / rep
    print(f'{name:8s} {time_empty:10.2e} {time_zeros:10.2e} {time_ones:10.2e} {time_full_zeros:10.2e} {time_full_ones:10.2e}  {time_empty_zeros:10.2e} {time_empty_ones:10.2e} ')

The timings are suggestive.

DType         Empty      Zeros       Ones
int8       1.37e-06   6.33e-04   5.73e-04   5.76e-04   5.73e-04    6.05e-04   5.82e-04 
int16      1.61e-06   1.55e-03   3.54e-03   3.54e-03   3.56e-03    3.54e-03   3.54e-03 
int32      7.22e-06   6.99e-06   1.24e-02   1.20e-02   1.25e-02    1.19e-02   1.21e-02 
int64      8.26e-06   8.06e-06   2.62e-02   2.64e-02   2.61e-02    2.62e-02   2.62e-02 
uint8      1.32e-06   6.30e-04   5.85e-04   5.86e-04   5.77e-04    5.70e-04   5.83e-04 
uint16     1.32e-06   1.63e-03   3.61e-03   3.65e-03   4.08e-03    4.08e-03   3.58e-03 
uint32     7.08e-06   7.20e-06   1.48e-02   1.41e-02   1.63e-02    1.44e-02   1.32e-02 
uint64     7.14e-06   7.13e-06   2.69e-02   2.67e-02   2.82e-02    2.68e-02   2.72e-02 
float16    1.31e-06   1.55e-03   3.56e-03   3.79e-03   3.54e-03    3.53e-03   3.55e-03 
float32    7.11e-06   6.95e-06   1.36e-02   1.35e-02   1.37e-02    1.35e-02   1.37e-02 
float64    7.27e-06   7.33e-06   3.13e-02   3.00e-02   2.75e-02    2.80e-02   2.75e-02

Re zeros being faster than ones I seem to remember that as suggested in the comments zeros indeed uses calloc which being a system routine with the sole purpose of allocating blocks of zeros is probably good at that.

Thanks. It seems there is a more or less fixed time for filling an array, one way or another, which is what applies to `np.ones`, but not `np.zeros`. Interesting to see in your case there are some significant differences between `np.empty` and `np.zeros`, but only for 8 and 16 bit types. — jdehesa, Mar 13 '19 at 16:10
It is indeed interesting that `np.zeros` is faster than empty. `np.zeros` must initialize entries while empy seams like just a malloc and that's all without having to fill each memory slot with 0. — J Agustin Barrachina, Oct 14 '19 at 11:09
Ok, according to [this](https://stackoverflow.com/questions/52262147/speed-of-np-empty-vs-np-zeros?noredirect=1&lq=1) empty is actually faster... — J Agustin Barrachina, Oct 14 '19 at 11:14

Performance of np.empty, np.zeros and np.ones

1 Answers1