Adding single integer to numpy array faster if single integer has python-native int type

Question

I add a single integer to an array of integers with 1000 elements. This is faster by 25% when I first cast the single integer from numpy.int64 to the python-native int.

Why? Should I, as a general rule of thumb convert the single number to native python formats for single-number-to-array operations with arrays of about this size?

Note: may be related to my previous question Conjugating a complex number much faster if number has python-native complex type.

import numpy as np

nnu = 10418
nnu_use = 5210
a = np.random.randint(nnu,size=1000)
b = np.random.randint(nnu_use,size=1)[0]

%timeit a + b                            # --> 3.9 µs ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a + int(b)                       # --> 2.87 µs ± 8.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note that the speed-up can be enormous (factor 50) for scalar-to-scalar-operations as well, as seen below:

np.random.seed(100)

a = (np.random.rand(1))[0]
a_native = float(a)
b = complex(np.random.rand(1)+1j*np.random.rand(1))
c = (np.random.rand(1)+1j*np.random.rand(1))[0]
c_native = complex(c)

%timeit a * (b - b.conjugate() * c)                # 6.48 µs ± 49.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a_native * (b - b.conjugate() * c_native)  # 283 ns ± 7.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a * b                                      # 5.07 µs ± 17.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a_native * b                               # 94.5 ns ± 0.868 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Update: Could it be that the latest numpy release fixes the speed difference? The release notes of numpy 1.23 mention that scalar operations are now much faster, see https://numpy.org/devdocs/release/1.23.0-notes.html#performance-improvements-and-changes and https://github.com/numpy/numpy/pull/21188. I am using python 3.7.6, numpy 1.21.2.

I don't know the answer, but I dare to make a guess: I think that `b` gets passed by reference whereas `int(b)` gets passed by value, and both take different code paths. This would explain the difference in variance since the pass-by-value version is often more cache-friendly and may also explain the mean difference. Another indicator that this might be the case is that `b` has an internal buffer (`memoryview(b)` works) and that you can inspect `b.flags` just like you can with any other ndarray. — FirefoxMetzger, Jul 21 '22 at 12:48

score 1 · Accepted Answer · answered Jul 21 '22 at 17:19

On my Windows PC with CPython 3.8.1, I get:

[Old] Numpy 1.22.4:
 - First test: 1.65 µs VS 1.43 µs
 - Second:     2.03 µs VS 0.17 µs

[New] Numpy 1.23.1:
 - First test: 1.38 µs VS 1.24 µs    <----  A bit better than Numpy 1.22.4
 - Second:     0.38 µs VS 0.17 µs    <----  Much better than Numpy 1.22.4

While the new version of Numpy gives a good boost, native type should always be faster than Numpy ones with the (default) CPython interpreter. Indeed, the interpreter needs to call C function of Numpy. This is not needed with native types. Additionally, the Numpy checks and wrapping is not optimal but Numpy is not designed for fast scalar computation in the first place (though the overhead was previously not reasonable). In fact, scalar computations are very inefficient and the interpreter prevent any fast execution.

If you plan to do many scalar operation you need to use a natively compiled code, possibly using Cython, Numba, or even a raw C/C++ module. Note that Cython do not optimize/inline Numpy calls but can operate on native types faster. A native code can do this certainly in one or even two order of magnitude less time.

Note that in the first case, the path in Numpy functions is not the same and Numpy does additional check that are a bit more expensive then the value is not a CPython object. Still, it should be a constant overhead (and now relatively small). Otherwise, it would be a bug (and should be reported).

Related: Why is np.sum(range(N)) very slow?

Adding single integer to numpy array faster if single integer has python-native int type

1 Answers1