How many ALU operations can run in parallel on modern x86_64?

Question

How many ALU operations can run in parallel on modern x86(_64) CPUs when each instruction has no dependency to others? It would obviously depend on a specific model, but then, where can I find such information? This wikipedia page of Sandy Bridge says it has 3 integer ALUs per core. Does this mean it can run 3 integer arithmetic instructions at the same time?

My question is about such code, for example,

add r10, r11
mov [...], r10
add r12, r13
mov [...], r12
add r14, r15
mov [...], r14

add rax, r11
mov [...], rax
add rax, r13
mov [...], rax
add rax, r15
mov [...], rax

I may be wrong, but my understanding is that the first 3 adds can run in parallel while the following adds have to wait until the previous operations have finished.

See https://uops.info/ for exact details on throughput and latency. (And https://agner.org/optimize/ for pipeline details relevant to getting more un-executed uops into the scheduler(s), especially on AMD CPUs where integer and FP/SIMD ALUs have separate ports and schedulers so after a cache miss, you might have 8 uops all dispatch to ports at once.) Also related [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342) re: `mov` taking an execution unit or not on SnB vs. IvB where there are fewer integer ALUs than sustained front-end bandwidth — Peter Cordes, Jan 15 '22 at 20:24
Intel Alder Lake has even wider cores, like 5 ALU ports and a front-end that can keep up, on the P-cores. Even the E-cores have 4/clock throughput for `add`, like earlier mainstream performance cores such as Ice Lake and Zen3. — Peter Cordes, Jan 15 '22 at 20:29
But yes, if those 5 `add` and `mov` instruction were all in the scheduler at once on SnB when `r10..r15` became ready, those 3 `add` instructions would dispatch in parallel if there weren't older uops waiting to run ([How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331)). The front-end (issue/rename) on SnB is only 4-wide, so if this instruction sequence was getting issued after an I-cache miss or an `lfence` or something, only the first 4 uops could get into the back-end in the first cycle. — Peter Cordes, Jan 15 '22 at 20:31

How many ALU operations can run in parallel on modern x86_64?

0 Answers0