What is the relation between AVX and Floating Point, and why is AVX used for FP calculations?

Question

I was studying about FP and AVX recently and on Wikipedia (https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Applications) I read that AVX is used for FP calculations. I can't figure out why an parallel environment FP are processes. Also https://forums.aida64.com/topic/1629-real-world-benefit-of-fpu-test/ in this form AIDA Administrator tell that FPU uses AVX etc.

Maybe start reading about [SIMD](https://en.wikipedia.org/wiki/SIMD) before reading about AVX. And then ask question about specific problems you want to solve. As-is, I think the question is a bit broad ... — chtz, Mar 07 '19 at 15:07
Related: https://stackoverflow.com/questions/3206101/extended-80-bit-double-floating-point-in-x87-not-sse2-we-dont-miss-it — chtz, Mar 07 '19 at 16:37

score 6 · Answer 1 · answered Mar 07 '19 at 23:13

6

I just want to know weather AVX helps in single FP operations like adding simple 3.5 to 1.5.

Yes, AVX is useful for scalar math, too, because it gives you 3-operand non-destructive operations. e.g.

vaddsd xmm1, xmm0, [b]

will put the 3.5 + 1.5 result into xmm1 without destroying the value in xmm0, unlike

addsd xmm0, [b]

Compilers use AVX instead of SSE for everything if you tell them they're allowed to do so. (gcc -march=haswell or gcc -march=znver1, or whatever.)

answered Mar 07 '19 at 23:13

Peter Cordes

328,167
45
605
847

Thanks that clear my confusion but also raise a lot of questions like how is FP calculations is done I've studied integer math but with help of avx how floating point math been done in hardware. – Huzama Ahmad Mar 10 '19 at 19:32
@HuzamaAhmad: basically the same way as its done with SSE/SSE2 `addsd`, but with a more flexible encoding for telling the CPU how to send data from registers/memory to the FP ALUs and back to registers. I don't understand the question. Are you asking how an FP multiplier is built? Something like multiply the mantissas, add the exponents, and normalize. – Peter Cordes Mar 11 '19 at 00:07
my questions is how is FP ALU is built in hardware(link to any useful website would helpful), also if i was not clear i was asking that a single FP no has signed bit, exponent, and significant bits these three are added separately by three clock cycles and using AVX we can add it using single cycle boosting performance. That's my understanding. I want to confirm it. – Huzama Ahmad Mar 11 '19 at 02:58
@HuzamaAhmad: well hardware FP is faster than software-emulated FP, but AVX isn't the only way to use the FPU. There's also SSE2 `addsd`, and x87 `fadd`. Talking about "clock cycles" is totally wrong here, though; on Skylake for example, `[v]addsd` have 4 cycle latency, and 2-per-clock throughput. There are 2 FP mul/add/fma 256-bit wide pipelines, and each are fully pipelined to accept a new input every clock cycle. See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](//stackoverflow.com/q/45113527) for more about throughput vs. latency. – Peter Cordes Mar 11 '19 at 03:14
Using `addsd` only uses the low element of the SIMD FPU. Fun fact: `fadd` 80-bit floating point on Skylake uses separate hardware, on port 5 instead of port 0 / port 1, with 3 cycle latency. But anyway, yes of course the HW FPUs operate on all the fields at once, but especially for add they can't be "done separately" anyway. And the FPU has single-cycle *throughput*, but each operation takes multiple cycles because it's more complex than integer add, requiring normalization of the result etc. – Peter Cordes Mar 11 '19 at 03:19
A quick google for `fpu hardware design` found https://www.embedded.com/design/configurable-systems/4212239/Hardware-Based-Floating-Point-Design-Flow- among other results. Or https://opencores.org/projects/fpu says it has a verilog implementation of a complete IEEE754 FPU. – Peter Cordes Mar 11 '19 at 03:20
Thanks a lot that helps a lot that give me enough direction to research more. – Huzama Ahmad Mar 11 '19 at 14:33

score 2 · Answer 2 · answered Mar 07 '19 at 15:06

2

AVX is a SIMD extension to the CPU, which provides the capability to process 8 x single precision or 4 x double precision operations in one instruction. For applications where you are processing arrays of data homogeneously you can therefore potentially get a 4x or 8x throughput improvement using AVX compered to using a single (scalar) FPU.

What is the relation between AVX and Floating Point, and why is AVX used for FP calculations?

2 Answers2