5

I'm trying to efficiently implement SHLD and SHRD instructions of x86 without using inline assembly.

uint32_t shld_UB_on_0(uint32_t a, uint32_t b, uint32_t c) {
    return a << c | b >> 32 - c;
}

seems to work, but invokes undefined behaviour when c == 0 because the second shift's operand becomes 32. The actual SHLD instruction with third operand being 0 is well defined to do nothing. (https://www.felixcloutier.com/x86/shld)

uint32_t shld_broken_on_0(uint32_t a, uint32_t b, uint32_t c) {
    return a << c | b >> (-c & 31);
}

doesn't invoke undefined behaviour, but when c == 0 the result is a | b instead of a.

uint32_t shld_safe(uint32_t a, uint32_t b, uint32_t c) {
    if (c == 0) return a;
    return a << c | b >> 32 - c;
}

does what's intended, but gcc now puts a je. clang on the other hand is smart enough to translate it to a single shld instruction.

Is there any way to implement it correctly and efficiently without inline assembly?

And why is gcc trying so much not to put shld? The shld_safe attempt is translated by gcc 11.2 -O3 as (Godbolt):

shld_safe:
        mov     eax, edi
        test    edx, edx
        je      .L1
        mov     ecx, 32
        sub     ecx, edx
        shr     esi, cl
        mov     ecx, edx
        sal     eax, cl
        or      eax, esi
.L1:
        ret

while clang does,

shld_safe:
        mov     ecx, edx
        mov     eax, edi
        shld    eax, esi, cl
        ret
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
xiver77
  • 2,162
  • 1
  • 2
  • 12
  • 1
    Did you try using `uint64_t`? Compilers often use SHLD + SHL to implement shifts on that, (if compiling for 32-bit mode), know how to get the high / low halves in/out without actually doing any work. Otherwise in 64-bit code you'll probably get actual shift/or. (It normally only makes sense to use variable-count `shld` with max operand-size, so `unsigned __int128` for 64-bit code). – Peter Cordes Jan 10 '22 at 21:38
  • `>> ((32 - c)%32)` ? That's valid C++ and has the same meaning, right? – MSalters Jan 10 '22 at 21:42
  • @MSalters: Yeah, that was my first thought to make it well-defined on 0, like in [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508). But that would make both shift counts 0, giving us `a|b` instead of `a`. (We want to shift no bits from b into a in that case.) It only happens to work because it apparently compiles to something that shifts out all the bits for `c=0`. It certainly would not work for runtime-variable counts, the way GCC -O3 compiles it with separate shifts / OR. (clang compiles `shld_UB_on_0` to an `shld` instruciton) – Peter Cordes Jan 10 '22 at 21:47
  • BTW, all x86 scalar shifts take their count modulo 32, or modulo 64 for 64-bit operand-size. x86 SIMD shifts like `psllq` saturate their count. – Peter Cordes Jan 10 '22 at 21:50
  • 3
    This has been filed against GCC as . Oh hey, I have seen the reporter’s name somewhere. – user3840170 Jan 10 '22 at 21:53
  • 1
    I tried my `uint64_t` idea; neither GCC nor clang use shld: https://godbolt.org/z/s5xeYTn8d. clang does for `uint64_t` inputs using qword operand-size `shld` on a `unsigned __int128`. – Peter Cordes Jan 10 '22 at 21:55
  • 1
    I wonder if this in intentional. [Seems](https://reviews.llvm.org/D2177) that once upon a time there were perf issues with SHLD. – David Wohlferd Jan 10 '22 at 22:05
  • @DavidWohlferd: Indeed, variable-count `shld r32, r32, cl` is still 4 uops on SKL, down to 3 uops on ICL. And 2 uops on Alder Lake E-cores (Gracemont). But 7 uops on Zen 1/2, down to 5 uops on Zen3. (https://uops.info/) OTOH, `shl reg, cl` is 3 uops on Intel SnB-family, so avoiding SHLD is probably not a good bet for tune=generic, although could well be for tune=znver1. If BMI2 is available for SHLX, that makes the non-shld path more attractive, especially if we could do something clever like zero-extend one input to 64-bit so a shift count of 32 can shift out all the bits. – Peter Cordes Jan 11 '22 at 02:14
  • Doh! I shoulda read the bug report. – David Wohlferd Jan 11 '22 at 02:54
  • If you want a specific instruction you should use inline assembler, modifying your c code until you get the instruction you want is brittle at best. The translation is completely up to the compiler and there are no guarantees that different versions of the compiler will generate the same instructions for a given construct. – Johan Jan 12 '22 at 10:31
  • 2
    @Johan it’s not about wanting a particular instruction, it’s about wanting good code generation. – Sneftel Jan 12 '22 at 10:40
  • see https://stackoverflow.com/questions/39281925/making-g-use-shld-shrd-instructions – Solomon Ucko Jan 29 '22 at 21:57
  • see https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/i386-expand.cc#L5936-L6238 – Solomon Ucko Jan 30 '22 at 00:00
  • some stuff with `__int128` generates `shld`/`shrd` – Solomon Ucko Jan 30 '22 at 00:06

1 Answers1

1

As far as I have tested with gcc 9.3 (x86-64), it translates the following code to shldq and shrdq.

uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)(((((unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64);
}

uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)((((unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63));
}

Also, gcc -m32 -O3 translates the following code to shld and shrd. (I have not tested with gcc (i386), though.)

uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)(((((uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32);
}

uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)((((uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}

(I have just read the gcc code and written the above functions, i.e. I'm not sure they are your expected ones.)

  • Looks good, compiles well with GCC and clang: https://godbolt.org/z/z6evjrE3d. (Except for the 32-bit ones with GCC11.2 -O3; those are insane vs. clang compiling to just shl/rd.) "i386 GCC" just means the default is `-m32`; there isn't a real difference. Ironically, the 32-bit functions compile to many more instructions for a 64-bit target, instead of using 32-bit operand-size `shld`. Clang with `-fsanitize=undefined` doesn't add any extra instructions, so that's a good sign that this doesn't accidentally depend on any UB; it does mask the shift count and use unsigned shifts. – Peter Cordes Apr 09 '22 at 04:13