From timing the creation of Nx4096x4096 arrays, it appears Numpy does it much faster when N = 2 or 3 than N = 1:
import numpy as np
%timeit a = np.zeros((2, 4096, 4096), dtype=np.float32, order='C')
5.24 µs ± 98.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a = np.zeros((4096, 4096), dtype=np.float32, order='C')
23.4 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The difference is shocking. Why is that so and how to make the case when N = 1 at least as fast as when N > 1? Could the "%timeit" be simply wrong for timing this?
Context: I need to create another single array of 4096 x 4096 with a different type (uint8), and I'm trying to get the fastest Pythonic (or Numpy-related) implementation. The Nx4096x4096 array wil be populated with non-zeros values from a 3-column array (read from a file) where the 1st column are 1D coordinates and 2nd and 3rd column are the intensity values for the 1st and 2nd image (hence the N=2 case). Using sparse matrix is for now not an option. There are 130 million of such files. So the above is happening as many times.
[EDIT] This is under Python 3.6.4, numpy 1.14 under macOS Sierra. Same version under Windows do not reproduce the same behavior. The np.zeros() for the smaller array take half the time than the twice-larger array. From the comments and the mentionned duplicate question I understand this can be due to thresholds in memory allocations. This does however defeat the purpose of %timeit
.
[EDIT 2] Regarding the duplicate question, the question here should be now more about how to time this function properly, without having to write extra code that will access the variable so the OS actually allocates the memory. Wouldn't that extra code bias the result of the timing? Isn't there a simple way to profile this?