The category of "FP instruction" would normally include loads and store (like x86 movsd xmm0, [rdi]
), register-copy, bitwise boolean operations, and other things that aren't FP math instructions because they don't involve any hard work of handling FP sign / exponent / mantissa, or rounding and normalizing the result.
Also, one machine instruction can do more than one FLOP (SIMD and/or FMA).
A program doing FP math will also include some integer instructions for loop overhead and maybe for array indexing or pointer increments (especially on ISAs like classic MIPS without indexed addressing modes), or if you compile in debug mode, but you asked about "floating point instruction".
A modern pipelined out-of-order CPU will have some limited number of FP execution units; such execution units take a lot of transistors, unlike scalar integer add
, so CPUs usually don't have enough back-end FP execution throughput to keep up with the front-end. (e.g. AMD Zen has a 5-instruction / 6-uop wide front-end, but only two SIMD FP add/mul/FMA execution units).
Some FP workloads will bottleneck on their throughput, running few enough other instructions that FP operation throughput is the limiting factor, regardless of what ISA you compile for. (Whether it allows memory source operands for FP mul like x86, or is a load/store ISA (e.g. a RISC) that requires separate load instructions.)
FLOPS (FLOPs/second) as a figure of merit tells you the theoretical max FP throughput, if you can keep other instruction overhead low enough for the CPU to actually keep its FP execution units fed with work to do.
Load and store instructions, and stuff like copying registers, aren't FP math operations, and don't count as FLOPS. Similarly, integer instructions for array index math and loop counters are usually just overhead in FP algorithms. (Some FP code uses sparse arrays stored compactly in data structures that have arrays of integer indices or whatever, so integer work can be part of the "real work" of a program in that case, but it's still not FP math).
Conversely, SIMD gets multiple math operations done with a single CPU instruction, allowing lots of work to fit through a pipeline that isn't ridiculously wide. (e.g. x86 vmulps ymm0, ymm1, [rdi]
loads 32 bytes from memory, and does 8 packed single-precision multiply operations between that data and the elements of ymm1
.)
FMA (fused multiply-add) is normally counted as two FLOPs, although most CPUs that support it natively do it in a single execution unit. So for example, Intel since Haswell can start two SIMD FMA operations per clock cycle, each operating on 32 bytes of data (8 floats or 4 doubles). So that's 2x 8 single-precision FLOPs per cycle per core.
(And it has the front-end bandwidth and back-end execution units to also run two non-FMA uops, e.g. loop overhead, storing the FMA results, or whatever, even including SIMD bitwise OR / AND / XOR, e.g. to flip sign bits in float vectors.)
vxorps
doesn't count as a FLOP because it's just simple bitwise, not math that has to handle the mantissa and exponent of the inputs and normalize the output. Nor do SIMD vector shuffles.
You might count things like x86 unpcklps xmm1, xmm0
as "floating point instructions". Besides having "packed single" in the mnemonic, there's a performance difference between integer and FP versions of the same shuffle or bitwise operation on some CPUs. For example, Intel Nehalem has 2 cycle bypass forwarding latency when an float-domain instruction reads input from a SIMD-integer instruction like paddq
.
See Agner Fog's microarch guide, https://agner.org/optimize/, and Q&As like
Related: