I was doing some computation, and measured the performance of ufuncs like np.cumsum
over different axes, to make the code more performant.
In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)
In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cumsum
over axis 1 is almost 2x faster than cumsum
over axis 0. Why is it so and what is going on behind the scenes? It'd be nice to have a clear understanding of the reason behind it. Thanks!
Update: After a bit of research, I realized that if someone is building an application where they always sum over only certain axis, then the array should be initialized in appropriate order: i.e. either C-order for axis=1 sums or Fortran-order for axis=0 sums, to save CPU time.
Also: this excellent answer on the difference between contiguous and non-contiguous arrays helped a lot!