Does AVX/AVX2 "exists" on each core?

Question

So, this AVX thing - it's like a small machine for each core? Or it's just like one engine-unit for whole CPU?

Like, can I use it on each core somehow? I'm playing with it, and I'm feeling like I might "overuse" it and create a bottleneck of some sort.

Can you please explain me this? Did I get it all wrong?

AVX is an ISA extension. In principle, this only guarantees that a program can use the state and behavior specified in the extension to perform data processing, not *how* the CPU handles implementing this state and behavior. There have been various ways of virtualizing the required hardware to implement CPU features in the past. That said, on current Intel CPUs that support AVX, different physical cores do not share AVX resources (logical cores/hyperthreads do, but that's no different than with other CPU resources). — EOF, Feb 20 '21 at 18:45

Peter Cordes · Accepted Answer · 2022-08-30T15:08:00.480

SIMD instructions on modern CPUs, like AVX vmulps ymm1, ymm2, ymm3 or SSE2 pmaddwd xmm0, xmm1, execute purely within the physical core running that instruction. Each physical core has its own execution resources that include SIMD/FP execution units. (In CPU architecture, scalar FP is normally grouped with SIMD. In modern x86, you actually use scalar versions of SSE2 or AVX instructions on the low element of vector registers to do scalar FP math.)

That's why max FLOP/s for the whole chip scales with number of cores: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

(There isn't just one "AVX unit" even per core, e.g. Haswell and Zen2 cores both have two 256-bit wide FMA units, and can run bitwise boolean vector instructions on even more ports for even higher per-clock throughput of those instructions.)

See also Does SIMD require a multi-core CPU? for an explanation of thread-level parallelism, which is different from SIMD (more data per instruction) and instruction-level parallelism (more instructions per cycle). The product of all three of these is overall throughput.

CPUs handle SIMD instructions (almost) exactly like integer instructions such as add eax, ecx. This (among other reasons) is why x86 CPUs can efficiently get data between integer and FP registers with pretty low latency, only 1 or 3 core clock cycles for instructions like cvttss2si eax, xmm0 (float->int with truncation) or vpmovmskb eax, ymm0 (bitmap of the high bit of each byte). https://uops.info/ and https://agner.org/optimize/ have more detail on performance numbers.

See https://www.realworldtech.com/haswell-cpu/4/ for a diagram of the execution units on each execution port in Intel Haswell. Notice that the scalar integer multiplier (imul) is on the same port as vaddps, so those instructions can't both start executing in the same clock cycle on a given core. (Skylake runs vaddps on either of its 2 FMA units).

For more background on how CPUs work, see Modern Microprocessors A 90-Minute Guide!.

AMD Bulldozer-family, pairs of cores share a SIMD/FP unit (and cache)

In Bulldozer/Piledriver/Steamroller/Excavator, each pair of (weak-ish) integer cores shares a SIMD/FP unit, L1i cache, and L2 cache. This is basically an alternative to SMT (e.g. Intel's Hyperthreading), which has somewhat more total throughput with all cores busy, but doesn't have the ability to run a single thread as fast as a single wider core.

So it's not really two separate cores in the normal sense, given how tightly coupled they are. But it's not a single core that can run two hardware threads either. It's like Siamese twins sharing part of their body. https://www.realworldtech.com/bulldozer/2/ describes it in more detail.

Bulldozer-family represents a number of experiments in CPU architecture, many of them proving to be unsuccessful. (Like a write-through L1d cache with a small 4k write-combining buffer). AMD Zen used a more conventional design a lot like Intel's: fully separate wide core with SMT to allow both high single-thread performance and running lots of threads with good aggregate throughput. And more conventional cache hierarchy with a normal write-back L1d cache. Zen keeps AMD's separation of SIMD/FP vs. scalar-integer parts of the pipeline, unlike Intel's more unified scheduler and execution ports. Zen1 even kept AMD's usual technique of splitting 256-bit instructions into 2 uops, until Zen 2 widened the execution units. (Intel has done this splitting for SSE on early CPUs like Pentium III and Pentium-M, but hasn't done it since Core 2: full-width execution units for any SIMD extensions they support.)

SIMD / FP instructions on Bulldozer have higher latency (min of 2 cycles even for stuff like pxor xmm0,xmm1) but that may be due to Bulldozer's "speed demon" approach to clocking higher. Latency to get data between integer and FP registers is especially bad, like 10 cycles. (But normally you aren't bouncing data back and forth all the time, and using integer regs in addressing modes for FP loads is fine. This is not the major or only reason Bulldozer-family CPUs were relatively slow.)

So it's not like rdrand eax which has to pull data from a randomness source shared by all cores, and is very slow compared to normal instructions (like 200 cycles on Ivy Bridge, more like a cache-miss load) because it has to go off-core. And because it's not used often enough to justify building even more HW to make it faster (e.g. buffering randomness in each core). (What is the latency and throughput of the RDRAND instruction on Ivy Bridge? has an answer from David Johnston, who worked on it at Intel).

score 2 · Answer 2 · answered Feb 20 '21 at 19:31

2

It can be implemented in several different ways. On most modern CPUs they have a 256-bit AVX implementation on every core.

There are a lot of fiddly details about how it is done. Some might execute a 128-bit process twice. Others do it in one cycle but slow down the core frequency. In all cases it increases the core power usage and heat output simply because it is doing more work. Running two threads of AVX processing on hyperthread pairs may run at half speed because they can't share the AVX unit. Etc.

If you are writing something like a game where speed and latency matter a lot then your best bet is to measure it. Either benchmark it in your lab on a lot of different types of hardware, or do a quick benchmark during game start then set the default values in configuration.

There are possible memory bottlenecks too. I managed to write some AVX code a while ago (just for my own fun) that hit the CPU memory bandwidth limit on a laptop. It had no issues running on a Xeon though.

answered Feb 20 '21 at 19:31

Zan Lynx

53,022
10
79
131

1

*Running two threads of AVX processing on hyperthread pairs may run at half speed because they can't share the AVX unit.* - That's no more likely than for integer code, except that AVX loops tend to be not branchy and more likely to be fairly high throughput if well tuned. But if they bottleneck on FP latency for example, hyperthreading can help quite nicely. "The AVX unit" is actually multiple execution units on different execution ports, all of which are fully pipelined (except the divider), so two threads could each be averaging 1 FMA per clock on the same physical core. – Peter Cordes Feb 20 '21 at 22:18
@PeterCordes I was under the impression that there were limited numbers of the large registers, unlike the way CPUs can rename multiple sets of 64-bit registers. The last time I did a lot of benchmarking of this stuff was with Haswell CPUs and there could have been multiple reasons that two threads on the same core were slow... – Zan Lynx Feb 21 '21 at 00:24
1

@ZanLynx: There are separate physical register files for integer and SIMD, yes. But on Haswell, they're actually the same size, 168 entries each. https://www.realworldtech.com/haswell-cpu/3/. (This is somewhat smaller than the ROB size, and yes can be the limiting factor in out-of-order execution window size. https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/). More int phys regs is common, though, e.g. Skylake ([180 int / 168 fp](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client))) and Zen1 (168 int / 160 fp, but YMM registers take 2 FP PRF entries). – Peter Cordes Feb 21 '21 at 01:15
Fun fact: low-power Silvermont only does OoO exec at all for integer, not FP instructions. No reg renaming for FP/SIMD on Silvermont. But any self-respecting mainstream core will do it; finding ILP across loop iterations is *very* important for most FP loops. – Peter Cordes Feb 21 '21 at 01:15

score 0 · Answer 3 · answered Feb 20 '21 at 18:56

0

Advanced Vector Extensions (AVX) are instructions. Every CPU will have different hardware for implementing them. As far as I know, every core has its own piece of hardware for everything related to those instructions (and the others too), so there's no interactions between them.

Also the memory will be quite isolated, since each core will work on its own L1 and L2 cache. The first interactions will happen on L3 cache, meaning that after parallelizing (multi threaded software), you should get a boost in performance, unless the way you access memory from one processor begins to conflict with the accesses of another one.

But my feeling is that you are worrying much before needed.

answered Feb 20 '21 at 18:56

Costantino Grana

3,132
1
15
35

2

AVX adds more than just instructions, it *doubles* the size of the vector registers, adding a lot of state. Early implementations of both AMD and Intel typically split at least some AVX instructions into halves and executed them in different cycles, to reduce the size of the functional units executing the instructions. CPU architectures have also shared vector unit state between cores in the past (AMD Bulldozer, some MIPS), to say nothing of SMT/hyperthreading. – EOF Feb 20 '21 at 19:04
1

@EOF: Intel has never split AVX instructions into halves, except for the load/store ports in Sandybridge / Ivy Bridge were only 16 bytes wide (and even then, it was a single uop that just wasn't fully pipelined, taking an extra cycle). The SIMD ALUs have always been full width on Intel. (Except for the FP div/sqrt unit, which actually did take multiple uops for `vdivps ymm` before Skylake.) – Peter Cordes Feb 20 '21 at 22:22

Does AVX/AVX2 "exists" on each core?

3 Answers3

AMD Bulldozer-family, pairs of cores share a SIMD/FP unit (and cache)

Linked