2

We know that numpy is C order stored so .sum(axis=1) should be faster that .sum(axis=0). But I find that

arr = np.ones([64, 64]) 
arr.flags
%timeit arr.sum(axis=1)  # 4.15 µs ± 97.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.sum(axis=0)  # 4.67 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But when the size change to 10000

arr = np.ones([10000, 10000])  #
arr.flags
%timeit arr.sum(axis=1)  # 109 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.sum(axis=0)  # 57.5 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
nan
  • 401
  • 4
  • 13
  • 1
    ```np.sum``` has well optimized, vectorized code paths. That is why 1. you don't see a strong effect from nonlocal memory access and 2. the sum along the outer dimension can keep partial sums (one per column) and fully vectorize that summation. That's the factor of 2 you see – Homer512 Aug 07 '22 at 15:51
  • @Homer512 still confused, could you give some reference? I thought vectorization is about Register, but when doing a col sum(axis=0),cache miss would happend frequently which should slow the speed a lot so arr.sum(axis=0) should be slower that arr.sum(axis=1) – nan Aug 08 '22 at 01:05
  • 1
    ```sum(axis=0)``` can be implemented as ```sum=np.zeros_like(arr[0]); for row in arr: sum+=row```. That can be streamed from memory efficiently and within one loop iteration all operations are independent and vectorized. ```sum(axis=-1)``` might still be faster but that needs more tuning so it doesn't surprise me that it is slightly slower in practice – Homer512 Aug 08 '22 at 06:37
  • @Homer512 excellent! Thank you for your answer. It is really a complicated problem, so I guess testing could be the best way to write high-performance code. – nan Aug 08 '22 at 07:39

1 Answers1

3

First of all, I cannot reproduce the effect with the last version of Numpy (dev) nor the version 1.21.5. I got respectively 30.5 ms and 36.5 ms (so the opposite behaviour). Thus, the result are likely dependent of the hardware.

On one hand, arr.sum(axis=1) use a pairwise sum algorithm. This algorithm is not so simple to implement efficiently. Numpy unroll the the leaf computation so to speed up the algorithm but it is not manually vectorized. In fact, the compiler (GCC) also fail to automatically vectorize it because is it far from being trivial. Numpy also take benefit from prefetching instructions so to speed the computation up and make it more cache friendly. That being said, this is not very useful on modern processors. The cache accesses are not prefect but the L1 cache and cache-lines reduce the overhead of the non contiguous accesses.

On the other hand, arr.sum(axis=0) is fully manually vectorized. This is the case because it is easier to vectorize operation vertically (eg. array+array) than horizontally (eg. reduction). On recent x86-64 architecture with AVX/AVX-2, it should make use of this SIMD instruction set which is capable of computing 4 double at a time. That being said, the operation is not unrolled and the result is accumulated in an array that can be bigger than the L1 cache. Indeed, the reduction is done row by row and each row takes 80 KB while for example my i5-9600KF processor has a L1 cache of 32 KB. This means data should be constantly reloaded from the L2 cache in this case.

Regarding the target architecture of the processor, the cache size and the memory, results can be different. Theoretically, nothing prevent Numpy to implement a code saturating the memory throughput, but as of now, the former use-case is more complex to optimize. Since the complexity of the Numpy code is important and the operation tends to be quite memory-bound on most machines (at least for a given core), we did not optimize the former case yet.

Related posts:

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59