SIMD instructions on modern CPUs, like AVX vmulps ymm1, ymm2, ymm3
or SSE2 pmaddwd xmm0, xmm1
, execute purely within the physical core running that instruction. Each physical core has its own execution resources that include SIMD/FP execution units. (In CPU architecture, scalar FP is normally grouped with SIMD. In modern x86, you actually use scalar versions of SSE2 or AVX instructions on the low element of vector registers to do scalar FP math.)
That's why max FLOP/s for the whole chip scales with number of cores: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
(There isn't just one "AVX unit" even per core, e.g. Haswell and Zen2 cores both have two 256-bit wide FMA units, and can run bitwise boolean vector instructions on even more ports for even higher per-clock throughput of those instructions.)
See also Does SIMD require a multi-core CPU? for an explanation of thread-level parallelism, which is different from SIMD (more data per instruction) and instruction-level parallelism (more instructions per cycle). The product of all three of these is overall throughput.
CPUs handle SIMD instructions (almost) exactly like integer instructions such as add eax, ecx
. This (among other reasons) is why x86 CPUs can efficiently get data between integer and FP registers with pretty low latency, only 1 or 3 core clock cycles for instructions like cvttss2si eax, xmm0
(float->int with truncation) or vpmovmskb eax, ymm0
(bitmap of the high bit of each byte). https://uops.info/ and https://agner.org/optimize/ have more detail on performance numbers.
See https://www.realworldtech.com/haswell-cpu/4/ for a diagram of the execution units on each execution port in Intel Haswell. Notice that the scalar integer multiplier (imul
) is on the same port as vaddps
, so those instructions can't both start executing in the same clock cycle on a given core. (Skylake runs vaddps on either of its 2 FMA units).
For more background on how CPUs work, see Modern Microprocessors
A 90-Minute Guide!.
AMD Bulldozer-family, pairs of cores share a SIMD/FP unit (and cache)
In Bulldozer/Piledriver/Steamroller/Excavator, each pair of (weak-ish) integer cores shares a SIMD/FP unit, L1i cache, and L2 cache. This is basically an alternative to SMT (e.g. Intel's Hyperthreading), which has somewhat more total throughput with all cores busy, but doesn't have the ability to run a single thread as fast as a single wider core.
So it's not really two separate cores in the normal sense, given how tightly coupled they are. But it's not a single core that can run two hardware threads either. It's like Siamese twins sharing part of their body. https://www.realworldtech.com/bulldozer/2/ describes it in more detail.
Bulldozer-family represents a number of experiments in CPU architecture, many of them proving to be unsuccessful. (Like a write-through L1d cache with a small 4k write-combining buffer). AMD Zen used a more conventional design a lot like Intel's: fully separate wide core with SMT to allow both high single-thread performance and running lots of threads with good aggregate throughput. And more conventional cache hierarchy with a normal write-back L1d cache. Zen keeps AMD's separation of SIMD/FP vs. scalar-integer parts of the pipeline, unlike Intel's more unified scheduler and execution ports. Zen1 even kept AMD's usual technique of splitting 256-bit instructions into 2 uops, until Zen 2 widened the execution units. (Intel has done this splitting for SSE on early CPUs like Pentium III and Pentium-M, but hasn't done it since Core 2: full-width execution units for any SIMD extensions they support.)
SIMD / FP instructions on Bulldozer have higher latency (min of 2 cycles even for stuff like pxor xmm0,xmm1
) but that may be due to Bulldozer's "speed demon" approach to clocking higher. Latency to get data between integer and FP registers is especially bad, like 10 cycles. (But normally you aren't bouncing data back and forth all the time, and using integer regs in addressing modes for FP loads is fine. This is not the major or only reason Bulldozer-family CPUs were relatively slow.)
So it's not like rdrand eax
which has to pull data from a randomness source shared by all cores, and is very slow compared to normal instructions (like 200 cycles on Ivy Bridge, more like a cache-miss load) because it has to go off-core. And because it's not used often enough to justify building even more HW to make it faster (e.g. buffering randomness in each core). (What is the latency and throughput of the RDRAND instruction on Ivy Bridge? has an answer from David Johnston, who worked on it at Intel).