Instruction/intrinsic for taking higher half of uint64_t in C++?

Question

Imagine following code:

uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);

y receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?

At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi and shr rax, 32 for this, so either CLang doesn't do such optimization, or there exists no such special instruction.

Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg.

Why wihtout shift? Why do you think some intrinsic will be better than shift? Shift does exactly what you want, so why there should be any special instruction? — Daniel Langr, May 19 '21 at 03:53
I see no reason to further optimize. It only takes 1 instruction even now (Not counting moving `rdi` into `rax`). Once it's shifted, the value is now in `eax`, the lower 32 bits of `rax` which is automatically the return value. — mediocrevegetable1, May 19 '21 at 03:54
@DanielLangr One reason could be due to pipelining. For example ALU is fully occupied, doing several instructions in parallel. And doing shift now is not possible. But some imaginary operation like moving higher part to lower part of register may be always available. — Arty, May 19 '21 at 03:56
If there's some secret instruction to optimize a right shift by a constant 32, I'd expect the compiler to know and apply it. — Mark Ransom, May 19 '21 at 03:58
Since you tagged your question with C, you can use union-based type punning that is legal in C99, `union { uint64_t x; uint32_t y[2]; }`. — 273K, May 19 '21 at 04:02
@S.M. I guess this union-solution will just read 4 bytes directly from memory location of 8-byte value? So I guess union-solution will always force to use memory and will deny to use register? — Arty, May 19 '21 at 04:05
" any CPU instruction that does this operation without doing shift?" --> C does not specify CPU instructions — chux - Reinstate Monica, May 19 '21 at 04:07
@chux-ReinstateMonica By tagging C/C++ I meant that I want to find some STD header like `intrin.h` or `immintrin.h` and corresponding C/C++ function inside it. Of cause C/C++ language itself doesn't specify CPU instructions, but I was interested in outer-header C/C++ functions for doing this. — Arty, May 19 '21 at 04:09
@Arty using a union like that may not involve memory at all, because compilers are already [storing small structs in registers](https://stackoverflow.com/q/42411819/995714). This is likely premature optimization, and I don't think any architectures have such instructions. Some 16-bit machines have swap instruction to swap the 2 bytes that can in some cases uses to implement this — phuclv, May 19 '21 at 04:12
Recommend for reading [Why can I access lower dword/word/byte in a register but not higher?](https://stackoverflow.com/questions/45500399/why-can-i-access-lower-dword-word-byte-in-a-register-but-not-higher). — 273K, May 19 '21 at 04:13
Is a right-shift really the performance bottleneck in your program? — Jeremy Friesner, May 19 '21 at 05:06
@JeremyFriesner I'm writing a special inline function that does 3-4 instructions in total, one of instructions is taking high half of 64-bit. And this inlined function will be used billions of time, it is like main computational function in my C++ program, which will do computation for many seconds. So if most of program's time is occupied by my 4 instructions then I wanted at least to have most optimal variant of this 4 instructions. Just a notice - I can't influence on other program and can't optimize it, also I can't change interface of my function, it is like a library function for others. — Arty, May 19 '21 at 05:39
@DanielLangr Just shift instruction is not good enough for the case when I need to do not inplace shift but to copy high part to other register. For copy variant shift-solution needs two instructions mov+shr. Would be great if there existed single instruction like `movhi dst_reg, src_reg` or at least `shr dst_arg, src_arg, 32`. — Arty, May 19 '21 at 06:29
Arty you should measure your program’s speed as-is, vs how fast it would be if you only read the data once, then increment each value and write it out again and did nothing else. Last time I did that I discovered that on my hardware it almost didn’t matter what cpu instructions I did or didn’t perform; the timing was roughly the same. That was because performance was being gated by RAM access time in all cases (ie it didn’t matter much how efficient the CPU’s math operations were, because it was spending 99.9% of its time waiting for DRAM reads and/or writes to complete anyway) — Jeremy Friesner, May 19 '21 at 09:16
@JeremyFriesner Thanks for reply! Currently users of my library function use it without much memory interaction, mostly time is spent only executing this mathematical code with few other bits transforms. So if I can't modify user's code then I want to make my code at least one cycle or one instruction shorter if possible. Of cause finally I have to measure all my improvements if they really give speedup (and not slowdown) on real use-case. — Arty, May 19 '21 at 09:21

Peter Cordes · Accepted Answer · 2021-05-19T06:15:09.757

If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)

You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)

You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32 to already generate the result in two separate 32-bit registers, instead of imul r64, r64 + shr reg,32, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont or other tuning options, you can't make the compiler do it that way.

shr reg, 32 has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.

Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov in addition to shr if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.

And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465

x86 with BMI2 has rorx rax, rdi, 32 to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar() where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)

There's no intrinsic for rorx; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell miss this optimization.) https://godbolt.org/z/ozjhcc8Te

If a compiler was doing this in a loop, it could have 32 in a register for shrx reg,reg,reg as a copy-and-shift. Or more silly, it could use pext with 0xffffffffULL << 32 as the mask. But that's strictly worse that shrx because of the higher latency.

AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr (bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4 (Excavator) uses bextr rax, rdi, 0x2020, while clang misses that optimization. gcc -march=znver1 uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.

Standard BMI1 bextr needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov+shr is faster on Intel CPUs.

Can you please tell if many (especially old) CPUs have this BMI2 set supported? Approximately how many percents of all world CPUs have BMI2? — Arty, May 19 '21 at 06:32
Also for me it is strange that there exists no intrinsic for `rorx`. I thought that there exist intrinsic functions for all imaginary instructions, just for convenience, if I don't want to rely on compiler guess but to use specific instruction explicitly. — Arty, May 19 '21 at 06:47
@Arty: BMI2 was new in Haswell for Intel, and some similar generation for AMD. https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set. Cloud servers will have it, but there are lots of home CPUs that don't. Some compilers have intrinsics for rotates in general ([Best practices for circular shift (rotate) operations in C++](//stackoverflow.com/q/776508)), but the only difference between RORX and ROR is not setting FLAGS, and having a separate destination. Register allocation and managing FLAGS are totally up to the compiler so it wouldn't make sense to force it to use RORX over ROR. — Peter Cordes, May 19 '21 at 06:51
"Most modern non-x86 ISAs are RISC-like with 3-operand instructions" --> Does that consider most processors these days are the billions per year of embedded ones? — chux - Reinstate Monica, May 19 '21 at 13:27
@chux-ReinstateMonica: not really. I had in mind ISAs that get any use for high-performance computing, like AArch64 and PowerPC64. If you have a 32-bit ISA like most embedded use, the halves of a `uint64_t` are already separate. Or for 16 bit halves of a uint32_t for ARM, then even in thumb mode, `lsr dst, src, 16` is I think encodeable as a single thumb instruction. Even lower-ended embedded stuff without a barrel shifter is a different beast, but often has some kind of swap instruction instead of needing a shift, which a compiler can use to implement `x >> 8` or `x >> 16` — Peter Cordes, May 19 '21 at 14:27

Instruction/intrinsic for taking higher half of uint64_t in C++?

1 Answers1