What is the difference between floating point instruction and floating point operation?

Question

I've been studying computer performance metrics and I have a doubt about MFLOPS. By definition, MFLOPS is (NumberOfFloatingPointOperations/ExecutionTime*10⁶). At first, I assumed that operation and instruction were the same. However, I discovered this in a PDF:

"... . Because it is based on operations in the program rather than on instructions, MFLOPS has a stronger claim than MIPS to being a fair comparison between different computers. The key to this claim is that the same program running on different computers may execute a different number of instructions but will always execute the same number of floating-point operations. ..."

It seems that operation and instruction are not the same. What is the difference?

Source: https://course.ccs.neu.edu/cs3650/ssl/TEXT-CD/Content/COD3e/InMoreDepth/IMD4-MFLOPS-as-a-Performance-Metric.pdf

@ScottHunter: Also SIMD and FMA, and non-math FP instructions like load or store! Modern mainstream ISAs like x86-64 and ARM64 aren't VLIW, but do have SIMD and FMA. — Peter Cordes, Jul 17 '21 at 20:13

score 1 · Answer 1 · answered Jul 18 '21 at 09:13

The most typical operations in FP domain are additions and multiplications. Arm64 Neon instruction set OTOH implements fused multiply accumulation, which in a single instruction is able to execute 2 most fundamental floating point operations.

SIMD generically can also execute 2,4,8,16 additions, multiplications and possibly even the fused multiply accumulations in parallel, increasing the number of floating point operations per instruction (or per clock cycle).

Furthermore with the introduction of FP16 used extensively in machine learning inference engines, one can squeeze out twice the number of operations per instruction -- the de facto industry standard still pretty much equates floating point operations to the single precision operations, forcing some manufactures to use other acronyms, such as neural operations per second.

Peter Cordes · Answer 2 · 2021-07-17T20:11:20.523

The category of "FP instruction" would normally include loads and store (like x86 movsd xmm0, [rdi]), register-copy, bitwise boolean operations, and other things that aren't FP math instructions because they don't involve any hard work of handling FP sign / exponent / mantissa, or rounding and normalizing the result.

Also, one machine instruction can do more than one FLOP (SIMD and/or FMA).

A program doing FP math will also include some integer instructions for loop overhead and maybe for array indexing or pointer increments (especially on ISAs like classic MIPS without indexed addressing modes), or if you compile in debug mode, but you asked about "floating point instruction".

A modern pipelined out-of-order CPU will have some limited number of FP execution units; such execution units take a lot of transistors, unlike scalar integer add, so CPUs usually don't have enough back-end FP execution throughput to keep up with the front-end. (e.g. AMD Zen has a 5-instruction / 6-uop wide front-end, but only two SIMD FP add/mul/FMA execution units).

Some FP workloads will bottleneck on their throughput, running few enough other instructions that FP operation throughput is the limiting factor, regardless of what ISA you compile for. (Whether it allows memory source operands for FP mul like x86, or is a load/store ISA (e.g. a RISC) that requires separate load instructions.)

FLOPS (FLOPs/second) as a figure of merit tells you the theoretical max FP throughput, if you can keep other instruction overhead low enough for the CPU to actually keep its FP execution units fed with work to do.

Load and store instructions, and stuff like copying registers, aren't FP math operations, and don't count as FLOPS. Similarly, integer instructions for array index math and loop counters are usually just overhead in FP algorithms. (Some FP code uses sparse arrays stored compactly in data structures that have arrays of integer indices or whatever, so integer work can be part of the "real work" of a program in that case, but it's still not FP math).

Conversely, SIMD gets multiple math operations done with a single CPU instruction, allowing lots of work to fit through a pipeline that isn't ridiculously wide. (e.g. x86 vmulps ymm0, ymm1, [rdi] loads 32 bytes from memory, and does 8 packed single-precision multiply operations between that data and the elements of ymm1.)

FMA (fused multiply-add) is normally counted as two FLOPs, although most CPUs that support it natively do it in a single execution unit. So for example, Intel since Haswell can start two SIMD FMA operations per clock cycle, each operating on 32 bytes of data (8 floats or 4 doubles). So that's 2x 8 single-precision FLOPs per cycle per core.

(And it has the front-end bandwidth and back-end execution units to also run two non-FMA uops, e.g. loop overhead, storing the FMA results, or whatever, even including SIMD bitwise OR / AND / XOR, e.g. to flip sign bits in float vectors.)

vxorps doesn't count as a FLOP because it's just simple bitwise, not math that has to handle the mantissa and exponent of the inputs and normalize the output. Nor do SIMD vector shuffles.

You might count things like x86 unpcklps xmm1, xmm0 as "floating point instructions". Besides having "packed single" in the mnemonic, there's a performance difference between integer and FP versions of the same shuffle or bitwise operation on some CPUs. For example, Intel Nehalem has 2 cycle bypass forwarding latency when an float-domain instruction reads input from a SIMD-integer instruction like paddq.

See Agner Fog's microarch guide, https://agner.org/optimize/, and Q&As like

How do I achieve the theoretical maximum of 4 FLOPs per cycle? - it's still non-trivial to write a loop that has little enough overhead or latency bottlenecks to actually let the CPU keep the FMA execution units fed with work.
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
What is the definition of Floating Point Operations ( FLOPs )
http://www.lighterra.com/papers/modernmicroprocessors/
https://www.realworldtech.com/haswell-cpu/

score 0 · Answer 3 · answered Jul 18 '21 at 11:40

Benchmarks are BS. They are too easy to manipulate and are manipulated. The notion of MFLOPS has always been flawed as you cannot measure it. MIPS is better-ish but is equally flawed because benchmarks are BS. Change a compile option, take the same binary run it on a different generation of processor, etc the results lead people to think one thing and make bad judgments for purchasing products or understanding peformance.

Floating point operations would be an operation an add, subtract, divide, etc. Instructions are different you need to get the value into the fpu in some way either into registers and then have the operation happen using register operands or have the operation happen using memory operands. Fill a table in the processor with bits and have it do some mass operation on that, etc. The per second is tied to instructions and how fast you can feed them into the processor and how fast the processor can operate on them. Today none of that is deterministic. And likewise it is usually compiled code which the list of and order of instructions is now dynamic based on the compiler. And if you think x86 in particular but other architectures as well, each generation is a new/modified design intel tends to run binaries tuned for the prior processor generation slower on the current, but if you optimize for the current it will run faster (recompile). AMD tended to do the opposite, have the existing binary run faster. Do either still do that as dramatically as they did in the past? I do not know.

But all of this lands in the fact that benchmarks are BS. What matters is the binary I am running on the hardware I am running it on right now. Is it fast enough and what can I change to make it faster? Recompile if you have the sources, hand tune the problem areas. Buy new hardware, but you cannot pre-determine through benchmarks if the new hardware will run the binary you need to run faster or slower. It is definitely a problem historically in the x86 world that same binary new processor is not necessarily faster, likewise you cannot really optimize for x86 and have it run everywhere well. Need to tune per generation/system if you care about performance. floating point vs fixed point vs just generic code does not matter it is the same problem.

Your quote of course is flawed as part of the benchmarks are BS claim I am making. You can attempt to create a benchmark that has some number of defined operations in that code. How it compiles through which compiler for each target determines how many of those operations are resolved at compile time and not actually turned into executable code. For the same target, with different compilers or same compiler different versions or same compiler same version, different options, or even something as simple as same compiler same version, different order of source or object files on the command line can/will affect the execution time. So same computer, same day, same compiler, same benchmark source code with a fixed number of "floating point operations" can have a vastly different "per second" result. If you cannot get the same computer, same compiler, same source code, to execute consistently then how can you possibly use that to compare against anything else? MFLOPS is as equally horrible as MIPS to try to compare different hardware.

You answered your title question btw with the quote you added. Operations is operations, add, divide, etc. And instructions are instructions. True there will be an add instruction that does an add operation, but the add will have some setup and recovery work to do depending. And how much varies by operation or set of sequential (or parallel) operations.

score -1 · Answer 4 · answered Jul 17 '21 at 17:10

-1

I think that a floating point operation implies a set of instructions. If we change ISA, it may be the same number of operations but not always the same number of instructions. Instruction and operation are not the same.

answered Jul 17 '21 at 17:10

Lucas Ariel Saavedra

11
2

What is the difference between floating point instruction and floating point operation?

4 Answers4