0

I'm learning SIMD intrinsics and parallel computing. I am not sure if Intel's definition for the x86 instruction sqrtpd says that the square root of the two numbers that are passed to it will be calculated at the same time:


Performs a SIMD computation of the square roots of the two, four, or eight packed double-precision floating-point values in the source operand (the second operand) and stores the packed double-precision floating-point results in the destination operand (the first operand).


I understand that it explicitly says SIMD computation but does this imply that for this operation the root will be calculated simultaneously for both numbers?

stht55
  • 390
  • 1
  • 8
  • Is your question whether this really happens at the same time or if "after" the instruction you have the resulting square-roots for all input-elements? Especially for AVX256 (I think) some CPUs will de-facto compute two pairs of sqrts after each other -- but the results should be available in the same cycle. – chtz Jun 05 '22 at 17:24

3 Answers3

5

For sqrtpd xmm, yes, modern CPUs do that truly in parallel, not running it through a narrower execution unit one at a time. Older (especially low-power) CPUs did do that. For AVX vsqrtpd ymm, some CPUs do perform it in two halves.

But if you're just comparing performance numbers against narrower operations, note that some CPUs like Skylake can use different halves of their wide div/sqrt unit for separate sqrtpd/sd xmm, so those have twice the throughput of YMM, even though it can do a full vsqrtpd ymm in parallel.

Same for AVX-512 vsqrtpd zmm, even Ice Lake splits it up into two halves, as we can see from it being 3 uops (2 for port 0 where Intel puts the div/sqrt unit, and that can run on other ports.)

Being 3 uops is the key tell-tale for a sqrt instruction being wider than the execution unit on Intel, but you can look at the throughput of YMM vs. XMM vs. scalar XMM to see how it's able to feed narrower operations do different pipes of a wide execution unit independently.


The only difference is performance; the destination x/y/zmm register definitely has the square roots of each input element. Check performance numbers (and uop counts) on https://uops.info/ (currently down but normally very good), and/or https://agner.org/optimize/.

It's allowed but not guaranteed that CPUs internally have wide execution units, as wide as the widest vectors they support, and thus truly compute all results in parallel pipes.

Full-width execution units are common for instructions other than divide and square root, although AMD from Bulldozer through before Zen1 supported AVX/AVX2 with only 128-bit execution units, so vaddps ymm decoded to 2 uops, doing each half separately. Intel Alder Lake E-cores work the same way.

Some ancient and/or low-power CPUs (like Pentium-M and K8, and Bobcat) have had only 64-bit wide execution units, running SSE instructions in two halves (for all instructions, not just "hard" ones like div/sqrt).

So far only Intel has supported AVX-512 on any CPUs, and (other than div/sqrt) they've all had full-width execution units. And unfortunately they haven't come up with a way to expose the powerful new capabilities like masking and better shuffles for 128 and 256-bit vectors on CPUs without the full AVX-512. There's some really nice stuff in AVX-512 totally separate from wider vectors.


The SIMD div / sqrt unit is often narrower than others

Divide and square root are inherently slow, not really possible to make low latency. It's also expensive to pipeline; no current CPUs can start a new operation every clock cycle. But recent CPUs have been doing that, at least for part of the operation: I think they normally end with a couple steps of Newton-Raphson refinement, and that part can be pipelined as it only involves multiply/add/FMA type of operations.

Intel has supported AVX since Sandybridge, but it wasn't until Skylake that they widened the FP div/sqrt unit to 256-bit.

For example, Haswell runs vsqrtpd ymm as 3 uops, 2 for port 0 (where the div/sqrt unit is) and one for any port, presumably to recombine the results. The latency is just about a factor of 2 longer, and throughput is half. (A uop reading the result needs to wait for both halves to be ready.)

Agner Fog may have tested latency with vsqrtpd ymm reading its own result; IDK if Intel can let one half of the operation start before the other half is ready, of if the merging uop (or whatever it is) would end up forcing it to wait for both halves to be ready before starting either half of another div or sqrt. Instructions other than div/sqrt have full-width execution units and would always need to wait for both halves.

I also collected divps / pd / sd / ss throughputs and latencies for YMM and XMM on various CPUs in a table on Floating point division vs floating point multiplication

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • What about computing half width with SIMD and the other half with an equally fast SIMDified Q3_inverse_Square_root? Can ooo-execution hide the latency of SIMD-sqrt in there? I guess they will use different ports? – huseyin tugrul buyukisik Jul 14 '22 at 15:06
  • 1
    @huseyintugrulbuyukisik: That's a good idea, for some use-cases, unrolling and doing one vector with `vsqrtps`, another vector with `vrsqrps` + Newton iteration + multiply + fixup (for the `0.0` input case which should give 0, not NaN). That would balance the front-end throughput and FMA-port bottleneck with the divider bottleneck. If you're *only* doing sqrt over an array in a loop, it might still be faster to use only the RSQRT method. – Peter Cordes Jul 14 '22 at 23:35
  • 1
    Of course, you wouldn't actually use the Q3 inverse square root! You'd use `_mm256_rsqrt_ps` as an initial approximation before your Newton iteration ([Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision](https://stackoverflow.com/q/31555260)). The Quake 3 method has been obsolete since SSE1 for code that can use SSE. ([Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?](https://stackoverflow.com/q/71608126)). (I corrected [your answer on another Q](https://stackoverflow.com/a/72319428/224132) about that recently, @huseyin) – Peter Cordes Jul 14 '22 at 23:37
  • Thanks for that. Yes I meant full square root calculation and nothing else. This low throughput of sqrt compared to adds & muls happen to be in gpus too. – huseyin tugrul buyukisik Jul 15 '22 at 07:48
3

To complete the great answer of @PeterCordes, this is indeed dependent of architecture. One can expect the two square roots to be computed in parallel (or possibly efficiently pipelined at the ALU level) on most recent mainstream processors though. Here is the latency and throughput for intel architectures (you can get it from Intel):

Architecture Latency single Latency packed XMM Throughput single Throughput packed XMM
Skylake 18 18 6 6
Knights Landing 40 38 33 10
Broadwell 20 20 7 13
Haswell 20 20 13 13
Ivy Bridge 21 21 14 14

The throughput (number of cycle per instruction) is generally what matter in SIMD codes, as long as out-of-order exec can overlap the latency chains for independent iterations. As you can see, on Skylake, Haswell and Ivy Bridge, the throughput is the same meaning that sqrtsd and sqrtpd xmm are equally fast. The pd version gets twice as much work done, so it must be computing two elements in parallel. Note that Coffee Lake, Cannon Lake and Ice Lake have the same timings as Skylake for this specific instruction.

For Broadwell, sqrtpd does not execute the operation in parallel on the two lanes. Instead, it pipelines the operation and most of the computation is serialized (sqrtpd takes 1 cycle less than two sqrtsd). Or it has a parallel 2x 64-bit div/sqrt unit, but can independently use halves of it for scalar sqrt, which would explain the latency being the same but the throughput being better for scalar instructions (like how Skylake is for sqrt ymm vs. xmm).

For KNL Xeon Phi, the results are a bit surprising as sqrtpd xmm is much faster than sqrtsd while computing more items in parallel. Agner Fog's testing confirmed that, and that it takes many more uops. It's hard to imagine why; just merging the scalar result into the bottom of an XMM register shouldn't be much different from merging an XMM into the bottom of a ZMM, which is the same speed as a full vsqrtpd zmm. (It's optimized for AVX-512 with 512-bit registers, but it's also slow at div/sqrt in general; you're intended to use vrsqrt28pd on Xeon Phi CPUs, to get an approximation that only needs one Newton iteration to get close to double precision. Other AVX-512 CPUs only support vrsqrt14pd/ps, lacking the AVX-512ER extension)

PS: It turns out that Intel reports the maximum throughput cost (worst case) when it is variable. (0.0 is one of the best cases, for example). The latency is a bit different from the one reported from Agner Fog's instruction table. The overall analysis remains the same though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    I think you're drawing the wrong conclusion about Broadwell; they wouldn't narrow the execution unit *and* really heavily pipeline it; I think they're just getting clever with scheduling scalar operations to halves of it independently, boosting `sqrtsd` throughput. That's why the numbers are identical to Haswell, except for `sqrtsd` throughput. I edited your answer to mention that possibility (which I think is much more likely), but left in your guess. If you agree with my reasoning, you might want to take out that guess about a narrower unit that is used serially for `sqrtpd xmm`. – Peter Cordes Jun 05 '22 at 20:38
0

Yes, SIMD (vector) instructions on packed operands perform the same operation on all vector elements "in parallel". This follows from the fact that sqrtsd (scalar square root on one double) and sqrtpd (packed square root on two doubles in a 128-bit register) have the same latency.

vsqrtpd for 256-bit and larger vectors may have higher latency on some processors, as the operation is performed on 128-bit parts of the vector sequentially. This may be true for vdivpd as well, but not other instructions - most of the time you can expect that the latency is the same regardless of the vector size. Consult with instruction tables if you want to be sure.

Andrey Semashev
  • 10,046
  • 1
  • 17
  • 27