TL;DR: np.random.rand
is vectorized but scalar and inefficient, while array + 1
is both vectorized and makes use of wide SIMD instructions.
Meaning of "vectorization"
First of all, Numpy vectorization is defined as the following:
NumPy hands off array processing to C, where looping and computation are much faster than in Python. To exploit this, programmers using NumPy eliminate Python loops in favor of array-to-array operations. vectorization can refer both to the C offloading and to structuring NumPy code to leverage it.
More generally in Python, vectorization means a Python code calls function that are natively compiled (typically in C/C++), and not that it is particularly optimized to efficiently use the CPU features (as indicated by @hpaulj in the comments).
This term has a different meaning for people working on compilers or using native languages like C/C++. In this different context, vectorization means using SIMD instructions.
np.random.rand
uses a scalar unoptimized code
The following analysis has been done on a Linux with Numpy 1.24.2 (from standard packages) and CPython 3.11.2 running on the x86-64 i5-9600KF processor.
np.random.rand(10000)
is vectorized (it calls a C code so to do the actual computation), but it does not uses SIMD instructions. In fact, it is currently not so well optimized in C either.
A low-level profiling shows that a significant part is spent two functions: mt19937_gen
(~13%) and random_standard_uniform_fill
(~6%). Most of the time seems to be lost in Cython checks, overheads and many small functions overall (~81%). mt19937
is a well-known 32-bit Mersenne Twister. The code appears to come from a Cython code (possibly calling the C++ standard library).
Nearly most of the time spent in mt19937_gen
is located in the following assembly loop:
loop:
and esi,0x80000000
add rdx,0x4
mov ecx,esi
mov esi,DWORD PTR [rdx]
mov eax,esi
and eax,0x7fffffff
or eax,ecx
mov ecx,eax
and eax,0x1
neg eax
shr ecx,1
xor ecx,DWORD PTR [rdx+0x630]
and eax,0x9908b0df
xor eax,ecx
mov DWORD PTR [rdx-0x4],eax
cmp rdx,rdi
jne loop
This is a pure scalar integer loop : there are no SIMD instructions. It does not seems to be particularly optimized at first glance. Note that the processor can execute many instruction in parallel (IPC up to 5 or 6, in code without bottlenecks or stalls), thanks to ILP (Instruction-level Parallelism that a single CPU core can find and exploit).
The other function takes most of its time calling another one (likely the previous one):
loop:
mov rdi,QWORD PTR [rbp+0x0]
call QWORD PTR [rbp+0x18] <--- very slow
movsd QWORD PTR [r13+rbx*8+0x0],xmm0
add rbx,0x1
cmp r12,rbx
jne loop <--- quite slow
This loop is clearly inefficient and not optimized at all. It is likely the one iterating over the array : it calls 10_000 times mt19937_gen
so to generate a random number stored in each item of the array (as indicated by @hpaulj in the comments).
Note that no other function called by Cython actually intensively uses any SIMD instruction (this can be seen using hardware performance counters).
Basic element-wise operations are optimized using SIMD instructions
The array + 1
is optimized using SIMD instructions. More specifically, it uses AVX (if available on your machine, otherwise SSE) on x86-64 processors. AVX is able to operate on 4 double-precision numbers per instructions. Modern mainstream processors can typically run 2 AVX instruction per cycle so 8 double-precision numbers can be computed per cycle.
That being said, note that array + 1
generates a new temporary array and this is quite expensive. It is generally better to do the operation in-place if you can using np.add(array, 1, out=array)
.