46

Why is np.dot so much faster than np.sum? Following this answer we know that np.sum is slow and has faster alternatives.

For example:

In [20]: A = np.random.rand(1000)

In [21]: B = np.random.rand(1000)

In [22]: %timeit np.sum(A)
3.21 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [23]: %timeit A.sum()
1.7 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [24]: %timeit np.add.reduce(A)
1.61 µs ± 19.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

But all of them are slower than:

In [25]: %timeit np.dot(A,B)
1.18 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Given that np.dot is both multiplying two arrays elementwise and then summing them, how can this be faster than just summing one array? If B were set to the all ones array then np.dot would simply be summing A.

So it seems the fastest option to sum A is:

In [26]: O = np.ones(1000)
In [27]: %timeit np.dot(A,O)
1.16 µs ± 6.37 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

This can't be right, can it?

This is on Ubuntu with numpy 1.24.2 using openblas64 on Python 3.10.6.

Supported SIMD extensions in this NumPy install:

baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2

Update

The order of the timings reverses if the array is much longer. That is:

In [28]: A = np.random.rand(1000000)
In [29]: O = np.ones(1000000)
In [30]: %timeit np.dot(A,O)
545 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [31]: %timeit np.sum(A)
429 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)    
In [32]: %timeit A.sum()
404 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [33]: %timeit np.add.reduce(A)
401 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This implies to me that there is some fixed sized overhead when calling np.sum(A), A.sum(), np.add.reduce(A) that doesn't exist when calling np.dot() but the part of the code that does the summation is in fact faster.

——————————-

Any speed ups using cython, numba, python etc would be great to see.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Simd
  • 19,447
  • 42
  • 136
  • 271
  • Refer to this similar question [here](https://stackoverflow.com/questions/42227432/why-is-it-that-np-dot-is-so-much-faster-than-finding-the-dot-product-using-for-l) – jony-jas Feb 24 '23 at 11:59
  • @JonyJasJ: `numpy.dot` and `numpy.sum` both do their math in C code. – user2357112 Feb 24 '23 at 12:00
  • 1
    I don't get that difference. With my test locally, I get 1.54µs for `A.sum()` and 1.72µs for `np.dot(A,O)` – oskros Feb 24 '23 at 12:01
  • Have you tried to calculate the time elapsed on np.dot before np.sum, did you get the same results? – Hamzah Feb 24 '23 at 12:02
  • @oskris which OS? – Simd Feb 24 '23 at 12:02
  • @Simd Windows 10 Enterprise 64-bit – oskros Feb 24 '23 at 12:02
  • 1
    macOS M1, NumPy 1.24.2, Python 3.11.2: `A.sum` is 807 ns, reduce variant 725 ns and `np.dot` 913 ns. – 9769953 Feb 24 '23 at 12:11
  • @9769953 what does np.show_config() say you are using for BLAS library? – Simd Feb 24 '23 at 12:47
  • @oskros what does np.show_config() say you are using for BLAS library? – Simd Feb 24 '23 at 12:47
  • BLAS and macOS? Part of the define_macros is `('HAVE_CBLAS', None)`. And for relevance, SIMD extensions show `baseline = NEON,NEON_FP16,NEON_VFPV4,ASIMD; found = ASIMDHP,ASIMDDP`. That would fit the timings and the answers on BLAS usage. – 9769953 Feb 24 '23 at 12:55
  • I wonder if a cython implementation would be the fastest option? – Simd Feb 24 '23 at 16:18
  • Both Numba and Python can be fast for this. Faster than Numpy and dot because of the mentioned overheads. The Cython code needs to be compiled with advanced optimization flags (like `-O3` on GCC/Clang) or it will certainly not be auto-vectorized (auto-vectorization was only enabled in O3 on GCC). `-march=native` can help too (for GCC/Clang), or the equivalent flag with MSVC. – Jérôme Richard Feb 24 '23 at 23:10
  • By the way, Firestorm cores of M1 processors have more SIMD units than modern x86-64 processors : 4 as opposed to 2 for x86-64. That being said, recent x86-64 have wider SIMD units (AVX is 256-bit wide on x86-64 while Neon on the M1/ARM is 128-bit wide). This makes scalar code like the one of Numpy twice faster but vectorized code should be about as fast for the same frequency (the M1 operate at a lower frequency than average x86-64 processors). This explain why the result are faster on the M1. – Jérôme Richard Feb 24 '23 at 23:14
  • @JérômeRichard could you show numba and python speed ups in your answer? – Simd Feb 25 '23 at 05:41
  • 1
    I have added a Numba code which is faster than `np.dot` on my machine. Defeating OpenBLAS is not easy but possible. – Jérôme Richard Feb 25 '23 at 12:55

2 Answers2

39

numpy.dot delegates to a BLAS vector-vector multiply here, while numpy.sum uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.

I don't know what BLAS library your NumPy is using, but a good BLAS will generally take advantage of SIMD operations, while numpy.sum doesn't do that, as far as I can see. Any SIMD usage in the numpy.sum code would have to be compiler autovectorization, which is likely less efficient than what the BLAS does.

When you raise the array sizes to 1 million elements, at that point, you're probably hitting a cache threshold. The dot code is working with about 16 MB of data, and the sum code is working with about 8 MB. The dot code might be getting data bumped to a slower cache level or to RAM, or perhaps both dot and sum are working with a slower cache level and dot is performing worse because it needs to read more data. If I try raising the array sizes gradually, the timings are more consistent with some sort of threshold effect than with sum having higher per-element performance.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • @Simd: Looks more like a cache threshold than a fixed overhead issue to me. – user2357112 Feb 25 '23 at 11:44
  • That makes sense for when the array is large. But when the array is small it looks like the slowdown might be caused by some overhead that dot doesn't have. – Simd Feb 25 '23 at 12:26
  • @Simd: `numpy.sum` does *also* have more fixed overhead, due to a bunch of Python-level pre-checks for stuff like `__array_function__` that I don't think `dot` does. This is why there's that performance gap between `numpy.sum(A)` and `A.sum()` - `A.sum()` doesn't go through those prechecks either. – user2357112 Feb 25 '23 at 12:31
  • I wonder if A.sum() also does checks that np.dot doesn't. – Simd Feb 25 '23 at 19:02
33

This answer completes the good answer of @user2357112 by providing additional details. Both functions are optimized. That being said the pair-wise summation is generally a bit slower while providing generally a more accurate result. It is also sub-optimal yet though relatively good. OpenBLAS which is used by default on Windows does not perform a pair-wise summation.

Here is the assembly code for the Numpy code:

enter image description here

Here is the assembly code for the OpenBLAS code:

enter image description here

The main issue with the Numpy code is that it does not use AVX (256-bit SIMD instruction set) but SSE (128-bit SIMD instruction set) as opposed to OpenBLAS, at least at the version 1.22.4 (the one I use) and before. Even worst: the instructions are scalar one in the Numpy code! We recently worked on this and the last version of Numpy should now use AVX. That being said, it may still be not as fast as OpenBLAS because of the pair-wise summation (especially for big arrays).

Note that both functions spent a non-negligible time in overheads because of the arrays being too small. Such overheads can be removed using a hand-written implementation in Numba.

The order of the timings reverses if the array is much longer.

This is expected. Indeed, functions are rather compute-bound when they operate in the cache, but they become memory-bound when the array is big and fit in the L3 cache or even the RAM. As a result, np.dot stats to be slower for bigger array since it needs to read twice bigger data from memory. More specifically, it needs to read 8*1000000*2/1024**2 ~= 15.3 MiB from memory so you likely need to read data from your RAM which have a pretty limited throughout. In fact, a good bi-channel 3200 MHz DDR4 RAM like mine can reach a practical throughput close to 40 GiB and 15.3/(40*1024) ~= 374 µs. That being said, sequential codes can hardly completely saturate this throughput so reaching a 30 GiB/s in sequential is already great not to mention many mainstream PC RAM operate at a lower frequency. A 30 GHz/s throughput result in ~500 µs which is close to your timings. Meanwhile, np.sum and np.add.reduce are more compute-bound because of their inefficient implementation, but the amount of data to be read is twice smaller and may actually fit better in the L3 cache having a significantly bigger throughput.

To prove this effect, you can simply try to run:

# L3 cache of 9 MiB

# 2 x 22.9 = 45.8 MiB
a = np.ones(3_000_000)
b = np.ones(3_000_000)
%timeit -n 100 np.dot(a, a)   #  494 µs => read from RAM
%timeit -n 100 np.dot(a, b)   # 1007 µs => read from RAM

# 2 x 7.6 = 15.2 MiB
a = np.ones(1_000_000)
b = np.ones(1_000_000)
%timeit -n 100 np.dot(a, a)   #  90 µs => read from the L3 cache
%timeit -n 100 np.dot(a, b)   # 283 µs => read from RAM

# 2 x 1.9 = 3.8 MiB
a = np.ones(250_000)
b = np.ones(250_000)
%timeit -n 100 np.dot(a, a)   # 40 µs => read from the L3 cache (quite compute-bound)
%timeit -n 100 np.dot(a, b)   # 46 µs => read from the L3 cache too (quite memory-bound)

On my machine, the L3 has only a size of 9 MiB so the second call needs not only to read twice more data but also read it more from the slower RAM than from the L3 cache.

For small array, the L1 cache is so fast that reading data should not be a bottleneck. On my i5-9600KF machine, the throughput of the L1 cache is huge : ~268 GiB/s. This means the optimal time to read two arrays of size 1000 is 8*1000*2/(268*1024**3) ~= 0.056 µs. In practice, the overhead of calling a Numpy function is much bigger than that.


Fast implementation

Here is a fast Numba implementation:

import numba as nb

# Function eagerly compiled only for 64-bit contiguous arrays
@nb.njit('float64(float64[::1],)', fastmath=True)
def fast_sum(arr):
    s = 0.0
    for i in range(arr.size):
        s += arr[i]
    return s

Here are performance results:

 array items |    time    |  speedup (dot/numba_seq)
--------------------------|------------------------
 3_000_000   |   870 µs   |   x0.57
 1_000_000   |   183 µs   |   x0.49
   250_000   |    29 µs   |   x1.38

If you use the flag parallel=True and nb.prange instead of range, Numba will use multiple threads. This is good for large arrays, but it may not be for small ones on some machine (due to the overhead to create threads and share the work):

 array items |    time    |  speedup (dot/numba_par)
--------------------------|--------------------------
 3_000_000   |   465 µs   |   x1.06
 1_000_000   |    66 µs   |   x1.36
   250_000   |    10 µs   |   x4.00

As expected, Numba can be faster for small array (because the Numpy call overhead is mostly removed) and be competitive with OpenBLAS for large array. The code generated by Numba is pretty efficient:

.LBB0_7:
        vaddpd  (%r9,%rdx,8), %ymm0, %ymm0
        vaddpd  32(%r9,%rdx,8), %ymm1, %ymm1
        vaddpd  64(%r9,%rdx,8), %ymm2, %ymm2
        vaddpd  96(%r9,%rdx,8), %ymm3, %ymm3
        vaddpd  128(%r9,%rdx,8), %ymm0, %ymm0
        vaddpd  160(%r9,%rdx,8), %ymm1, %ymm1
        vaddpd  192(%r9,%rdx,8), %ymm2, %ymm2
        vaddpd  224(%r9,%rdx,8), %ymm3, %ymm3
        vaddpd  256(%r9,%rdx,8), %ymm0, %ymm0
        vaddpd  288(%r9,%rdx,8), %ymm1, %ymm1
        vaddpd  320(%r9,%rdx,8), %ymm2, %ymm2
        vaddpd  352(%r9,%rdx,8), %ymm3, %ymm3
        vaddpd  384(%r9,%rdx,8), %ymm0, %ymm0
        vaddpd  416(%r9,%rdx,8), %ymm1, %ymm1
        vaddpd  448(%r9,%rdx,8), %ymm2, %ymm2
        vaddpd  480(%r9,%rdx,8), %ymm3, %ymm3
        addq    $64, %rdx
        addq    $-4, %r11
        jne     .LBB0_7

That being said, it is not optimal : the LLVM-Lite JIT compiler uses a 4x unrolling while a 8x unrolling should be optimal on my Intel CoffeeLake processor. Indeed, the latency of the vaddpd instruction is 4 cycle while 2 instruction can be executed per cycle so 8 registers are needed to avoid a stall and the resulting code being latency bound. Besides, this assembly code is optimal on Intel Alderlake and Sapphire Rapids processors since they have a twice lower vaddpd latency. Saturating FMA SIMD processing units is far from being easy. I think the only way to write a faster function is to write a (C/C++) native code using SIMD intrinsics though it is less portable.

Note the Numba code does not support special numbers like NaN or Inf value because of fastmath (AFAIK OpenBLAS does). In practice, it should still work on x86-64 machines but this is not guaranteed. Besides, the Numba code is not numerically stable for very large arrays. The Numpy code should be the most numerically stable of the three variants (then the OpenBLAS code). You can compute the sum by chunk to improve the numerical stability though it makes the code more complex. There is not free lunch.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • I *think* any SIMD usage in the `numpy.sum` code would have to be due to the compiler auto-vectorizing the unrolled loop - I don't see a code path with manual vectorization for this particular routine. – user2357112 Feb 24 '23 at 12:44
  • The timings I gave were for Linux with openblas64. I am not sure the same time difference exists under Windows. Would you be able to check? – Simd Feb 24 '23 at 12:46
  • @user2357112 Yes, this is not done manually AFAIR. But we tweak the compiler so to enabled AVX for some functions. The generator typically does that. I recently enabled AVX for more function by requesting the generator to do that, but `np.sum` is a bit special because I remember it was manually unrolled and the code needs to be tweaked a bit. I planned to optimize it but delayed this since a while ^^" . I think using a manual vectorization is certainly a good idea for this function. I forgot the most important point: the SSE instruction are scalar ones. – Jérôme Richard Feb 24 '23 at 13:04
  • @Simd The problem of the vectorization is also present on Linux. See my last related post on the topic: [No speedup when summing uint16 vs uint64 arrays with NumPy?](https://stackoverflow.com/questions/70134026/no-speedup-when-summing-uint16-vs-uint64-arrays-with-numpy/70136429#70136429) – Jérôme Richard Feb 24 '23 at 13:06
  • I have added an update that implies, at least to me, that the reason for the slow down is not the summation part of the code but some overhead that is of constant size. – Simd Feb 25 '23 at 11:08
  • I added additional details and a pretty fast Numba code :) . – Jérôme Richard Feb 25 '23 at 12:44
  • 1
    The numba timings with fastmath are very impressive! – Simd Feb 25 '23 at 14:11
  • Perhaps it has less of an impact on summation, but as someone forced to work with ARMv7-A, I cringe at enabling fastmath - it enables the broken NEON which flushes subnormals to 0 and is generally not IEEE-754 2008 compliant. Also, with what compiler and what flags was the `np.sum()` assembly you include compiled? Not asking about OpenBLAS because I'm assuming it's just handwritten assembly. – jaskij Feb 27 '23 at 10:57
  • @jaskij I did not compiled Numpy myself: this is the standard package (for Windows) coming from [pypi](https://pypi.org/project/numpy/). I used the version 1.22.4 which is not up to date though (hence the comment in the answer). OpenBLAS is a dependency of Numpy provided in the same package. There was a similar problem on Linux (see the above link), but I fixed the Numpy code so GCC was able to generate a significantly faster code. Still, the Numpy code needs to be improved. For Numba, the compilation parameters are the one provided in the njit decorator. – Jérôme Richard Feb 27 '23 at 20:05
  • @jaskij I agree that fastmath is not great, but it is a simple and efficient way to break the dependency chain of such a FP loop. The JIT compiler cannot auto-vectorize the loop because of the the strict IEEE-754 standard and specifically the non-associativity of FP operations. Numba does not support more precise flags unfortunately. That being said C++ compiler like GCC/Clang often needs more than `-fassociative-math` (sadly). While it is quite easy in C++ to write a IEEE-754 vectorized loop, it is unfortunately not the case in Numba (no stack-based arrays and no SIMD intrinsics by default) – Jérôme Richard Feb 27 '23 at 20:14
  • @JérômeRichard thanks for the info, it will be useful. I actually am responsible for cross-compiling Numpy for embedded devices at work. In the end, since a lot of our computationally heavy code is math, I chose to go with clang as it, to my knowledge, has the better auto-vectorizer and generally deals with math better. For OpenBLAS I just assumed assembly since it's been so heavily optimized over the years, with hand-written kernels all around. – jaskij Feb 28 '23 at 16:53
  • @jaskij The auto-vectorizer of Clang was clearly worst then GCC 4-8 years ago, but Clang has improved over time. I still find Clang not as good as GCC in many case involving vectorization but it is pretty dependent of the use-case. Clang better unroll the loop than GCC, but GCC is able to do very clever optimization in non-trivial cases. Both are relatively good in trivial cases. The Clang auto-vectorizer tends to fail when the loop is manually unrolled which is cumbersome (I reported a bug about this but it did not get much attention so far). The best is to test. – Jérôme Richard Feb 28 '23 at 18:56
  • @jaskij If you use an Intel machine, you could try ICC which is very good for auto-vectorizing codes (it is widely used on HPC platforms and is practically significantly better than both GCC and Clang for that). You could also use the MKL which is generally a bit faster than OpenBLAS on Intel machine. For AMD machines, BLIS should be a good alternative to OpenBLAS. On AMD machines, ICC might still generate a faster code and GCC/Clang. You could try AOCC alternatively on such platform. – Jérôme Richard Feb 28 '23 at 19:00
  • @JérômeRichard good points, I will have to actually evaluate how our software performs between GCC and clang. And, for now, I'm mostly working on deploying on AArch64, although an x86-64 server is planned as well. No clue to the actual hardware as of yet though. – jaskij Feb 28 '23 at 20:44