2

I was studying about FP and AVX recently and on Wikipedia (https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Applications) I read that AVX is used for FP calculations. I can't figure out why an parallel environment FP are processes. Also https://forums.aida64.com/topic/1629-real-world-benefit-of-fpu-test/ in this form AIDA Administrator tell that FPU uses AVX etc.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Maybe start reading about [SIMD](https://en.wikipedia.org/wiki/SIMD) before reading about AVX. And then ask question about specific problems you want to solve. As-is, I think the question is a bit broad ... – chtz Mar 07 '19 at 15:07
  • Related: https://stackoverflow.com/questions/3206101/extended-80-bit-double-floating-point-in-x87-not-sse2-we-dont-miss-it – chtz Mar 07 '19 at 16:37

2 Answers2

6

I just want to know weather AVX helps in single FP operations like adding simple 3.5 to 1.5.

Yes, AVX is useful for scalar math, too, because it gives you 3-operand non-destructive operations. e.g.

vaddsd xmm1, xmm0, [b]

will put the 3.5 + 1.5 result into xmm1 without destroying the value in xmm0, unlike

addsd xmm0, [b]

Compilers use AVX instead of SSE for everything if you tell them they're allowed to do so. (gcc -march=haswell or gcc -march=znver1, or whatever.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks that clear my confusion but also raise a lot of questions like how is FP calculations is done I've studied integer math but with help of avx how floating point math been done in hardware. – Huzama Ahmad Mar 10 '19 at 19:32
  • @HuzamaAhmad: basically the same way as its done with SSE/SSE2 `addsd`, but with a more flexible encoding for telling the CPU how to send data from registers/memory to the FP ALUs and back to registers. I don't understand the question. Are you asking how an FP multiplier is built? Something like multiply the mantissas, add the exponents, and normalize. – Peter Cordes Mar 11 '19 at 00:07
  • my questions is how is FP ALU is built in hardware(link to any useful website would helpful), also if i was not clear i was asking that a single FP no has signed bit, exponent, and significant bits these three are added separately by three clock cycles and using AVX we can add it using single cycle boosting performance. That's my understanding. I want to confirm it. – Huzama Ahmad Mar 11 '19 at 02:58
  • @HuzamaAhmad: well hardware FP is faster than software-emulated FP, but AVX isn't the only way to use the FPU. There's also SSE2 `addsd`, and x87 `fadd`. Talking about "clock cycles" is totally wrong here, though; on Skylake for example, `[v]addsd` have 4 cycle latency, and 2-per-clock throughput. There are 2 FP mul/add/fma 256-bit wide pipelines, and each are fully pipelined to accept a new input every clock cycle. See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](//stackoverflow.com/q/45113527) for more about throughput vs. latency. – Peter Cordes Mar 11 '19 at 03:14
  • Using `addsd` only uses the low element of the SIMD FPU. Fun fact: `fadd` 80-bit floating point on Skylake uses separate hardware, on port 5 instead of port 0 / port 1, with 3 cycle latency. But anyway, yes of course the HW FPUs operate on all the fields at once, but especially for add they can't be "done separately" anyway. And the FPU has single-cycle *throughput*, but each operation takes multiple cycles because it's more complex than integer add, requiring normalization of the result etc. – Peter Cordes Mar 11 '19 at 03:19
  • A quick google for `fpu hardware design` found https://www.embedded.com/design/configurable-systems/4212239/Hardware-Based-Floating-Point-Design-Flow- among other results. Or https://opencores.org/projects/fpu says it has a verilog implementation of a complete IEEE754 FPU. – Peter Cordes Mar 11 '19 at 03:20
  • Thanks a lot that helps a lot that give me enough direction to research more. – Huzama Ahmad Mar 11 '19 at 14:33
2

AVX is a SIMD extension to the CPU, which provides the capability to process 8 x single precision or 4 x double precision operations in one instruction. For applications where you are processing arrays of data homogeneously you can therefore potentially get a 4x or 8x throughput improvement using AVX compered to using a single (scalar) FPU.

See also: FMA

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thanks, I know about SIMD and I know exactly what you said but I'm confused in context that weather AVX is used to solve single FP number, because of somethings i read on internet, if yes then how. – Huzama Ahmad Mar 07 '19 at 15:17
  • @HuzamaAhmad: your question is not very clear - please give a concrete example of something that you do not understand, otherwise it's just too vague to answer. – Paul R Mar 07 '19 at 15:22
  • I just want to know weather AVX helps in single FP operations like adding simple 3.5 to 1.5. – Huzama Ahmad Mar 07 '19 at 15:26
  • 1
    @HuzamaAhmad So you actually want to know why nowadays compilers use the SSE/AVX unit instead of the x87-FPU (even for scalar math)? – chtz Mar 07 '19 at 15:48
  • @chtz yes exactly also in what scenario it uses avx and in what sse – Huzama Ahmad Mar 10 '19 at 19:30