5

Considering this function,

float mulHalf(float x) {
    return x * 0.5f;
}

the following function produces the same result with normal input/output.

float mulHalf_opt(float x) {
    __m128i e = _mm_set1_epi32(-1 << 23);
    __asm__ ("paddd\t%0, %1" : "+x"(x) : "xm"(e));
    return x;
}

This is the assembly output with -O3 -ffast-math.

mulHalf:
        mulss   xmm0, DWORD PTR .LC0[rip]
        ret

mulHalf_opt:
        paddd   xmm0, XMMWORD PTR .LC1[rip]
        ret

-ffast-math enables -ffinite-math-only which "assumes that arguments and results are not NaNs or +-Infs" [1].

So the compiled output of mulHalf might better use paddd with -ffast-math on if doing so produces faster code under the tolerance of -ffast-math.

I got the following tables from Intel Intrinsics Guide.

(MULSS)
Architecture    Latency Throughput (CPI)
Skylake         4       0.5
Broadwell       3       0.5
Haswell         5       0.5
Ivy Bridge      5       1

(PADDD)
Architecture    Latency Throughput (CPI)
Skylake         1       0.33
Broadwell       1       0.5
Haswell         1       0.5
Ivy Bridge      1       0.5

Clearly, paddd is a faster instruction. Then I thought maybe it's because of the bypass delay between integer and floating-point units.

This answer shows a table from Agner Fog.

Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3 

Seeing this, paddd still seems like a winner, especially on CPUs later than Sandy Bridge, but specifying -march for recent CPUs just change mulss to vmulss, which has a similar latency/throughput.

Why don't GCC and Clang optimize multiplication by 2^n with a float to paddd even with -ffast-math?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
xiver77
  • 2,162
  • 1
  • 2
  • 12
  • 1
    Not every possible optimization is implemented. I think you should report it as missed optimization. Maybe someone will look at it (IMO from my experience, very unlikely). You can also submit the solution which increases the chance of being seen. – 0___________ May 24 '22 at 18:45
  • @0___________ I mean, multiplication by a `2^n` constant is a very common operation in various codebases. I don't think compiler writers over many decades didn't implement this optimization simply because they missed thinking about this. – xiver77 May 24 '22 at 18:49
  • 2
    When the input is `0.0`, your "optimized" version returns `-inf`... https://godbolt.org/z/vG517vs6f – Nate Eldredge May 24 '22 at 18:56
  • 2
    @NateEldredge Oh.. so embarrassing.. I used that optimized version to halve the width/height, which couldn't be `0`, so that's why it didn't broke.. Thanks. – xiver77 May 24 '22 at 19:00
  • Related: [Fast multiplication/division by 2 for floats and doubles (C/C++)](https://stackoverflow.com/q/7720668) / [Why doesn't a compiler optimize floating-point \*2 into an exponent increment?](https://stackoverflow.com/q/12919184) – Peter Cordes Oct 03 '22 at 21:17

1 Answers1

11

This fails for an input of 0.0f, which -ffast-math doesn't rule out. (Even though technically that's a special case of a subnormal that just happens to also have a zero mantissa.).

Integer subtraction would wrap to an all-ones exponent field, and flip the sign bit, so you'd get 0.0f * 0.5f producing -Inf, which is simply not acceptable.

@chtz points out that the +0.0f case can be repaired by using psubusw, but that still fails for -0.0f -> +Inf. So unfortunately that's not usable either, even with -ffast-math allowing the "wrong" sign of zero. But being fully wrong for infinities and NaNs is also undesirable even with fast-math.


Other than that, yes I think this would work, and pay for itself in bypass latency vs. ALU latency on CPUs other than Nehalem, even if used between other FP instructions.

The 0.0 behaviour is a showstopper. Besides that, the underflow behaviour is a lot less desirable than with FP multiply for other inputs, e.g. producing a subnormal even when FTZ (flush to zero on output) is set. Code that reads it with DAZ set (denormals are zero) would still handle it properly, but the FP bit-pattern might also be wrong for a number with the minimum normalized exponent (encoded as 1) and a non-zero mantissa. e.g. you could get a bit-pattern of 0x00000001 as a result of multiplying a normalized number by 0.5f.

Even if not for the 0.0f showstopper, this weirdness might be more than GCC would be willing to inflict on people. So I wouldn't expect it even for cases where GCC can prove non-zero, unless it could also prove far from FLT_MIN. That may be rare enough not to be worth looking for.

You can certainly do it manually when you know it's safe, although much more convenient with SIMD intrinsics. I'd expect rather bad asm from scalar type-punning, probably 2x movd around integer sub, instead of keeping it in an XMM for paddd when you only want the low scalar FP element.

Godbolt for several attempts, including straightforward intrinsics which clang compiles to just a memory-source paddd like we hoped. Clang's shuffle optimizer sees that the upper elements are "dead" (_mm_cvtss_f32 only reads the bottom one), and is able to treat them as "don't care".

// clang compiles this fully efficiently
// others waste an instruction or more on _mm_set_ss to zero the upper XMM elements
float mulHalf_opt_intrinsics(float x) {
    __m128i e = _mm_set1_epi32(-1u << 23);
    __m128 vx = _mm_set_ss(x);
    vx = _mm_castsi128_ps( _mm_add_epi32(_mm_castps_si128(vx), e) );
    return _mm_cvtss_f32(vx);
}

And a plain scalar version. I haven't tested to see if it can auto-vectorize, but it might conceivably do so. Without that, GCC and clang do both movd/add/movd (or sub) to bounce the value to a GP-integer register.

float mulHalf_opt_memcpy_scalar(float x) {
    uint32_t xi;
    memcpy(&xi, &x, sizeof(x));
    xi += -1u << 23;
    memcpy(&x, &xi, sizeof(x));
    return x;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I got a comment about the `0` case just before your answer. I've no idea why I missed that simple case, although it didn't occur in my actual code.. – xiver77 May 24 '22 at 19:12
  • 1
    BTW type punning with a `union` or `memcpy` produces horrible code especially by GCC (https://godbolt.org/z/zGcfecenK). – xiver77 May 24 '22 at 19:14
  • @xiver77: Yeah, I saw Nate's comment pop up after I had half of this answer typed. It didn't immediately occur to me either, though: at first I was thinking in terms of nasty wrap-around behaviour for sub-normal inputs or generating weird subnormals as quality-of-implementation issues (weirder than FTZ/DAZ), and only then thought about "and if the mantissa is zero... oh wait that's 0.0") – Peter Cordes May 24 '22 at 19:23
  • 1
    @xiver77: https://godbolt.org/z/bKenf5xzd compiles less badly: don't write part of an object and then read the whole thing; compilers hate that. If you're going to use a union, use `float f[4]`. Or use intrinsics like a normal person, with `_mm_set_ss(x)` and `_mm_cvtss_f32(vx)`. Clang fully optimizes that, GCC doesn't realize that the upper 3 elements are don't-care and still zeros them. GCC12.1 picks a longer strategy than GCC11.3 with `-march=haswell`, so I used that, just `vinsertps` to zero the upper 3. Or just scalar type-pun, although that confirms my guess of movd/add/movd. – Peter Cordes May 24 '22 at 19:35
  • 1
    If the input is known to be positive, you could use `psubusw` to at least guarantee that `0.f` --> `0.f` and subnormal --> subnormal (instead of NaN/Inf). Of course, a lot of cases still give wrong results, e.g. `inf*0.5` is finite and small normal numbers give wrong subnormal numbers. – chtz May 26 '22 at 10:09
  • @chtz: Good idea, but `-0.0f` can occur naturally, and that would result in `+Inf`. So that's still a showstopper even for `-ffast-math` assumptions about the lack of infinities and subnormals. – Peter Cordes May 26 '22 at 15:35