Why does clang emit a 32-bit float ps instruction for the absolute value of a 64-bit double?

Question

Why is clang turning fabs(double) into vandps instead of vandpd (like GCC does)?

#include <math.h>

double float_abs(double x) {
    return fabs(x);
}

clang 12.0.1 `-std=gnu++11 -Wall -O3 -march=znver3`

.LCPI0_0:
        .quad   0x7fffffffffffffff              # double NaN
        .quad   0x7fffffffffffffff              # double NaN
float_abs(double):                          # @float_abs(double)
        vandps  xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

gcc 11.2 `-std=gnu++11 -Wall -O3 -march=znver3`

float_abs(double):
        vandpd  xmm0, xmm0, XMMWORD PTR .LC0[rip]
        ret
.LC0:
        .long   -1
        .long   2147483647
        .long   0
        .long   0

(Ironically, GCC uses vandpd but defines the constant with 32-bit .long chunks (interestingly with the upper half zero), while clang uses vandps but defines the constant as two .quad halves.

They're exactly equivalent, right? Bitwise AND is the same operation whether you think of it as operating on two 64-bit elements or on four 32-bit elements. The instructions are the same length, too. So probably just an arbitrary choice. — Nate Eldredge, Sep 07 '21 at 23:52
Unless there is somehow a bypass delay issue with one or the other? I'm aware of that effect when mixing integer and floating-point vector instructions, i.e. https://stackoverflow.com/questions/26942952/difference-between-the-avx-instructions-vxorpd-and-vpxor?noredirect=1&lq=1, but not sure if it applies when mixing floating-point instructions with different element sizes. Given that the type is `double`, we assume the value in `xmm0` is the result of a 64-bit load or FP arithmetic instruction, so if there's any difference `vandpd` is presumably better, but there may not be any. — Nate Eldredge, Sep 07 '21 at 23:59
@NateEldredge: AFAIK, no x86 CPUs have ever had bypass delays between ps and pd instructions for bitwise booleans, only [between actual FP math instructions on Bulldozer-family](https://stackoverflow.com/questions/62111946/what-is-the-point-of-sse2-instructions-such-as-orpd#comment110006402_62112042). (The likely reason is the CPU keeping some FP normalization metadata associated with the vector for the shorter-latency FMA unit -> FMA unit special forwarding path.) I thought I'd mentioned that on at least one of the multiple answers I've written about this, but it seems not :/ — Peter Cordes, Sep 08 '21 at 00:16
@NateEldredge: I did find a mention that no CPUs have bypass delays between float and double in my answer on [What is the point of SSE2 instructions such as orpd?](https://stackoverflow.com/q/62111946) though. And that probably explains why clang likes to use `ps`: the non-VEX encoding is shorter, and maybe it simplifies the compiler logic to *always* use the `ps` version instead of only doing that for legacy-SSE encodings, not for AVX. (The Bulldozer reformatting issue is only addps -> addpd or whatever, which is nonsensical, not with booleans between.) — Peter Cordes, Sep 08 '21 at 00:18

Peter Cordes · Answer 1 · 2021-09-14T00:42:47.197

TL:DR: Probably because it's easier for the optimizer / code-generator to always do this, instead of only with legacy-SSE instructions to save code-size. There's no performance downside, and they're architecturally equivalent (i.e. no correctness difference.)

Probably clang always "normalizes" architecturally equivalent instructions to their ps version, because those have a shorter machine-code encoding for the legacy-SSE versions.

No existing x86 CPUs have any bypass delay latency for forwarding between ps and pd instructions¹, so it's always safe to use [v]andps between [v]mulpd or [v]fmadd...pd instructions.

As What is the point of SSE2 instructions such as orpd? points out, instructions like movupd and andpd are completely useless wastes of space that only exist for decoder consistency: a 66 prefix in front of an SSE1 opcode always does the pd version of it. It might have been smarter to save some of that coding space for other future extensions, but Intel didn't do that.

Or perhaps the motivation was the future possibility of a CPU that did have separate SIMD-double vs. SIMD-float domains, since it was early days for Intel's FP SIMD in general when SSE2 was being designed on paper. These days we can say that's unlikely because FMA units take a lot of transistors, and can apparently be built to share some mantissa-multiplier hardware between one 53-bit mantissa per 64-bit element vs. two 23-bit mantissas per 2x 32-bit elements.

Having separate forwarding domains would probably only be useful if you also had separate execution units for float vs. double math, not sharing transistors, unless you had different input and output ports for different types but the same actual internals? IDK enough about that level of CPU design detail.

There's no advantage to ps for the AVX VEX-encoded versions, but also no disadvantage, so it's probably simpler for LLVM's optimizer / code generator to just always do that instead of ever caring about trying to respect the source intrinsics. (Clang / LLVM doesn't in general try to do that, e.g. it freely optimizes shuffle intrinsics into different shuffles. Often this is good, but sometimes it de-optimizes carefully crafted intrinsics when it doesn't know a trick that the author of the intrinsics did.)

e.g. LLVM probably thinks in terms of "FP-domain 128-bit bitwise AND", and knows the instruction for that is andps / vandps. There's no reason for clang to even know that vandpd exists, because there's no case where it would help to use it.

Footnote 1: Bulldozer hidden metadata and forwarding between math instructions:
AMD Bulldozer-family has a penalty for nonsensical things like mulps -> mulpd, for actual FP math instructions that actually care about the sign/exponent/mantissa components of an FP value (not booleans or shuffles).

It basically never makes sense to treat the concatenation of two IEEE binary32 FP values as a binary64, so this isn't a problem that needs to be worked around. It's mostly just something that gives us insight into how the CPU internals might be designed.

In the Bulldozer-family section of Agner Fog's microarch guide, he explains that the bypass delay for forwarding between two math instructions that run on the FMA units is 1 cycle lower than if another instruction is in the way. e.g. addps / orps / addps has worse latency than addps / addps / orps, assuming those three instructions form a dependency chain.

But for a crazy thing like addps / addpd / orps, you get extra latency. But not for addps / orps / addpd. (orps vs orpd never makes a difference here. shufps would also be equivalent.)

The likely explanation is that BD kept extra stuff with vector elements to be reused in that special forwarding case, to maybe avoid some formatting / normalization work when forwarding FMA->FMA. If it's in the wrong format, that optimistic approach has to recover and do the architecturally required thing, but again, that only happens if you actually treat the result of a float FMA/add/mul as doubles, or vice versa.

addps could forward to a shuffle like unpcklpd without delay, so it's not evidence of 3 separate bypass networks, or any justification for the use (or existence) of andpd / orpd.

Why does clang emit a 32-bit float ps instruction for the absolute value of a 64-bit double?

clang 12.0.1 `-std=gnu++11 -Wall -O3 -march=znver3`

gcc 11.2 `-std=gnu++11 -Wall -O3 -march=znver3`

1 Answers1

Linked

Why does clang emit a 32-bit float ps instruction for the absolute value of a 64-bit double?

clang 12.0.1 -std=gnu++11 -Wall -O3 -march=znver3

gcc 11.2 -std=gnu++11 -Wall -O3 -march=znver3

1 Answers1

Linked

clang 12.0.1 `-std=gnu++11 -Wall -O3 -march=znver3`

gcc 11.2 `-std=gnu++11 -Wall -O3 -march=znver3`