How many data is one vector operation processing in one go on my machine?

Question

I am using ubuntu 22.10 on a thinkpad carbon x1 gen 7 that has 8 cores.

The following code will use vectorisation:

import numpy as np
np.random.rand(10000) + 1

I understand that of the 10000 numbers N will be processed in parallel by a single core. And then, next N and so on. That is vectorisation. It will take a single instruction and be able to apply that to many data points on a single core itself (This single core part contrasts vectorisation with parallelization)

How do I find out that N for my machine?

Check whether your CPU supports AVX (32 byte vectors) or AVX-512 (64 byte vector registers). Also keep in mind that there's [instruction-level parallelism](https://softwareengineering.stackexchange.com/a/350024/165079) within one core, e.g. two SIMD FMA units per core, so for example Intel since Haswell (2013) can start 2x 32-byte vector FMAs (2x 8 floats or 2x 4 doubles) per clock cycles. [FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2](https://stackoverflow.com/q/15655835) / [Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) — Peter Cordes, Mar 18 '23 at 10:46
For client CPUs like Ice Lake / Tiger Lake laptop with AVX-512, they'll only be capable of one 512-bit FMA per clock or two 256-bit FMAs. So the actual total width of all the FMA units in your machine would be 512 (64 bytes of floats or doubles). Similarly, AMD Zen1 and Zen4 have half-width execution units vs. the widest SIMD instruction set they support (AVX1 and AVX-512 respectively). — Peter Cordes, Mar 18 '23 at 10:50
In your expression the `random` is probably run one at a time, calling the underlying random generator sequentially. Adding `1` to the resulting array might be done in larger blocks. That I imagine depends on the details of the compilation - whether it's been optimized for processors such as yours. The key part of `numpy vectorization` is working with compiled code, not CPU level parallelism. — hpaulj, Mar 18 '23 at 15:24
@hpaulj: I'd hope that NumPy's PRNG is vectorized; that's usually pretty easy to do (running the same PRNG with a different seed in each vector element) e.g. [What's the fastest way to generate a 1 GB text file containing random digits?](https://unix.stackexchange.com/a/324520) shows an AVX2 xorshift+. In that case I'm chopping it up the entropy into ASCII digits, but you could equally use it to generate uniform random integers or floats in 32 or 64-bit element sizes. But if NumPy has any guarantee of repeatability with the same seed, it might have to fully serialize. — Peter Cordes, Mar 19 '23 at 06:23
@hpaulj: SIMD parallelism is potentially another factor of 2 to 8 or maybe 16 in throughput for 32-bit elements, if memory bandwidth isn't a bottleneck, on top of avoiding CPython interpreter overhead (factor of maybe 100 or so). So yeah, interpreter overhead is the big thing that makes NumPy have usable performance *at all*, but it would be a mistake to discount SIMD, especially for medium-sized arrays that can be hot in cache. That's how NumPy can (sometimes) keep up with well-optimized C or C++ instead of still being one order of magnitude behind. — Peter Cordes, Mar 19 '23 at 06:27
Thank you soo much for your help guys. I didn't think the rabbit hole went this deep. I was like: There would be a straight forward answer like I can find to check for the number of cores in my machine. Going through your comments with the help of the internet is amazing :) — figs_and_nuts, Mar 19 '23 at 13:52
@PeterCordes Numpy cannot easily change the random number generator algorithm since it would break the compatibility with previous versions. That being said, it seems there is a big room for improvement since Cython overheads are taking clearly most of the time (which is really sad for such a mainstream function). — Jérôme Richard, Mar 19 '23 at 13:56

score 1 · Answer 1 · edited Mar 19 '23 at 19:54

TL;DR: np.random.rand is vectorized but scalar and inefficient, while array + 1 is both vectorized and makes use of wide SIMD instructions.

Meaning of "vectorization"

First of all, Numpy vectorization is defined as the following:

NumPy hands off array processing to C, where looping and computation are much faster than in Python. To exploit this, programmers using NumPy eliminate Python loops in favor of array-to-array operations. vectorization can refer both to the C offloading and to structuring NumPy code to leverage it.

More generally in Python, vectorization means a Python code calls function that are natively compiled (typically in C/C++), and not that it is particularly optimized to efficiently use the CPU features (as indicated by @hpaulj in the comments).

This term has a different meaning for people working on compilers or using native languages like C/C++. In this different context, vectorization means using SIMD instructions.

`np.random.rand` uses a scalar unoptimized code

The following analysis has been done on a Linux with Numpy 1.24.2 (from standard packages) and CPython 3.11.2 running on the x86-64 i5-9600KF processor.

np.random.rand(10000) is vectorized (it calls a C code so to do the actual computation), but it does not uses SIMD instructions. In fact, it is currently not so well optimized in C either.

A low-level profiling shows that a significant part is spent two functions: mt19937_gen (~13%) and random_standard_uniform_fill (~6%). Most of the time seems to be lost in Cython checks, overheads and many small functions overall (~81%). mt19937 is a well-known 32-bit Mersenne Twister. The code appears to come from a Cython code (possibly calling the C++ standard library).

Nearly most of the time spent in mt19937_gen is located in the following assembly loop:

loop:
    and    esi,0x80000000
    add    rdx,0x4
    mov    ecx,esi
    mov    esi,DWORD PTR [rdx]
    mov    eax,esi
    and    eax,0x7fffffff
    or     eax,ecx
    mov    ecx,eax
    and    eax,0x1
    neg    eax
    shr    ecx,1
    xor    ecx,DWORD PTR [rdx+0x630]
    and    eax,0x9908b0df
    xor    eax,ecx
    mov    DWORD PTR [rdx-0x4],eax
    cmp    rdx,rdi
    jne    loop

This is a pure scalar integer loop : there are no SIMD instructions. It does not seems to be particularly optimized at first glance. Note that the processor can execute many instruction in parallel (IPC up to 5 or 6, in code without bottlenecks or stalls), thanks to ILP (Instruction-level Parallelism that a single CPU core can find and exploit).

The other function takes most of its time calling another one (likely the previous one):

loop:
    mov   rdi,QWORD PTR [rbp+0x0]
    call  QWORD PTR [rbp+0x18]            <---  very slow
    movsd QWORD PTR [r13+rbx*8+0x0],xmm0
    add   rbx,0x1
    cmp   r12,rbx
    jne   loop                            <---  quite slow

This loop is clearly inefficient and not optimized at all. It is likely the one iterating over the array : it calls 10_000 times mt19937_gen so to generate a random number stored in each item of the array (as indicated by @hpaulj in the comments).

Note that no other function called by Cython actually intensively uses any SIMD instruction (this can be seen using hardware performance counters).

Basic element-wise operations are optimized using SIMD instructions

The array + 1 is optimized using SIMD instructions. More specifically, it uses AVX (if available on your machine, otherwise SSE) on x86-64 processors. AVX is able to operate on 4 double-precision numbers per instructions. Modern mainstream processors can typically run 2 AVX instruction per cycle so 8 double-precision numbers can be computed per cycle.

That being said, note that array + 1 generates a new temporary array and this is quite expensive. It is generally better to do the operation in-place if you can using np.add(array, 1, out=array).

How many data is one vector operation processing in one go on my machine?

1 Answers1

Meaning of "vectorization"

np.random.rand uses a scalar unoptimized code

Basic element-wise operations are optimized using SIMD instructions

`np.random.rand` uses a scalar unoptimized code