2

I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.

In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs). However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?

EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.

wanwan
  • 93
  • 10
  • 1
    `mul`, `add`, `sub` and `div` are **not** floating-point operations. They operate on integers. The FLOPs end in `ps` or `sd`, etc. – OrangeDog Sep 10 '18 at 13:36
  • 1
    Possible duplicate of [What all operations does FLOPS include?](https://stackoverflow.com/questions/29428812/what-all-operations-does-flops-include) – OrangeDog Sep 10 '18 at 13:39
  • I should have said `*, +, -, /`, I'm asking more general questions about FLOPs. – wanwan Sep 10 '18 at 13:46
  • are you asking whether or not non-floating point machine code, like soft float, etc are counted as floating point operations. – old_timer Sep 10 '18 at 15:18
  • no, my question is simple: for example, `_mm_shuffle_ps` takes 2 **floating point** vectors as inputs, so this operation is a floating operation. Is this right ? – wanwan Sep 10 '18 at 16:28
  • Normally shuffle / blend / AND/OR/XOR on FP values are not considered FLOPs. FP absolute value using AND could be justified (but normally would be counted). Shuffle/blend are just overhead of using SIMD on not purely "vertical" problems, or problems with branching. – Peter Cordes Sep 10 '18 at 18:25
  • @OrangeDog: I think it's not a duplicate because that question is too vague and isn't considering SIMD at all. An answer that answers this wouldn't really be appropriate there. – Peter Cordes Sep 10 '18 at 18:28

2 Answers2

3

Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.

Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.


FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.

So the FLOP/s of a Haswell core is

  • its SIMD vector width (8 float elements per vector)
  • times SIMD FMA per clock (2)
  • times FLOPs per FMA (2)
  • times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).

For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)

Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.

See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)


Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.

Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.

The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.

(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.

For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.

Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.

But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.

Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.