How to write inline assembly to bit rotate

Question

I was reading gcc's guide on extended ASM and I'm running into a problem where the compiler isn't interpreting the assembly the way I thought it would. I thought I'd try it with a bit rotate instruction since those aren't readily available in C.

Here's my C function:

int rotate_right(int num,int count) {
    asm (
        "rcr %[value],%[count]"
        : [value] "=r" (num)
        : [count] "r" (count)
        );

    return num;
}

And the compiled output using x86-64 gcc (trunk) -O0:

        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     DWORD PTR [rbp-8], esi
        mov     eax, DWORD PTR [rbp-8]
        rcr eax,eax
        mov     DWORD PTR [rbp-4], eax
        mov     eax, DWORD PTR [rbp-4]
        pop     rbp
        ret

The problem I'm having is that GCC is taking my inline assembly to mean "rotate EAX by EAX rather than by the count parameter I intended. This is what I expected to get:

        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     DWORD PTR [rbp-8], esi
        mov     eax, DWORD PTR [rbp-8]
        mov     ecx, DWORD PTR [rbp-4]
        rcr     eax,ecx
        pop     rbp
        ret

Note that while the language might not natively support rotate your compiler very likely offers it as an intrinsic. With gcc check if your installation has a "ia32intrin.h" and the __rolb, __rolw, __rold, __rolq and __rorb, __rorw, __rord, __rorq functions. MSVC offers similar functionality, just using different headers and function names. — SoronelHaetir, Jul 01 '22 at 17:52
Did you actually want 33-bit rotate-with-carry instead of a normal 32-bit rotate? https://www.felixcloutier.com/x86/rcl:rcr:rol:ror. You haven't specified anything that would produce a carry input even in your "expected" version. But plain ROL / ROR are a solved problem without inline asm: [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508) — Peter Cordes, Jul 02 '22 at 01:57
@PeterCordes I wanted the normal 32-bit one. I'm so used to the Z80 mnemonics ```RRA``` for 9-bit rotate-with-carry and ```RRCA``` for 8-bit rotate that when I go to most other CPUs I can't remember which one is which. — puppydrum64, Jul 17 '22 at 17:18
Are you sure that's right? A recent retrocomputing Q&A about 8080's backward mnemonic convention ([Why are the Intel 8080's rotate instructions called opposite to intuition?](https://retrocomputing.stackexchange.com/a/24786)) has an answer saying RRA was a normal 8-bit rotate-right, while the mnemonic including a C (RRCA) was the one rotating through carry. — Peter Cordes, Jul 17 '22 at 17:31

score 3 · Answer 1 · answered Jul 01 '22 at 18:27

First, let me solve your problem as stated in the title.

static inline int ror(int num, int count) {
  __asm__ ("ror\t%0, %b1" : "+r"(num) : "c"(count));
  return num;
}

ror(int, int):
        mov     eax, edi
        mov     ecx, esi
        ror     eax, cl
        ret

This is how you do it, and don't forget -masm=intel. I'll explain some details below, but basically, you have to read carefully the GCC docs.

Quoting the OP,

I really find gcc's inline asm syntax much worse than Visual Studio's. It's almost as if GCC is trying to discourage users from using assembly...

It's worse in a sense that it takes more time to learn, but after you know the details, it's a powerful tool for various low level programing and optimization.

One case I use inline assembly in an actual program is to use the rcpss instruction. There is an Intel intrinsic for it, but the current version of GCC (12.1) produces quite horrible code when you use it for a single float.

static inline float float_recip(float x) {
  if (__builtin_constant_p(x)) {
    return 1 / x;
  }
  __asm__ ("rcpss\t%0, %0" : "+x"(x));
  return x;
}

This is the actual code. __builtin_constant_p makes constant substitution possible when the value of x is known at compile-time. I intentionally put both operands the same to avoid the false dependency problem.

See how the assembly is generated when it's called somewhere.

float f(float x) {
  return float_recip(x) + float_recip(2);
}

f(float):
        rcpss   xmm0, xmm0
        addss   xmm0, DWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1056964608

You can see float_recip(2) is replaced with a 0.5f constant, and all the unnecessary copies are gone.

You cannot do this with MSVC inline assembly, apart from that it's not even supported for 64-bit.

fuz · Accepted Answer · 2022-07-01T18:48:45.523

2

Use a +r constraint for num indicating that num will be read and not just written to. Otherwise gcc will assume that the previous value of num doesn't matter and just picks an unused register to dump the output into.

You'll also have to use a c constraint for count as the shift amount must be in cl for the ror instruction. Refer to the other answer for a more detailed explanation.

Before doing any inline assembly programming, read the manual carefully! It is somewhat tricky to get right and there are many subtle details to pay attention to.

Also note that even if the inline assembly seems to work right, it is possible to be incorrect, e.g. due to missing clobbers that just so happen to not affect anything relevant with this particular compiler version at this particular optimisation level for this particular version of the code. So be extra careful and try to avoid using it if possible.

For example in your case, you can just use the standard C rotation idiom. The compiler will pick it up as long as optimisations are enabled:

#include <limits.h>

int rotate_right(int num,int count) {
    return ((unsigned)num >> count | num << CHAR_BIT * sizeof num - count);
}

edited Jul 01 '22 at 18:48

answered Jul 01 '22 at 17:32

fuz

88,405
25
200
352

I really find gcc's inline asm syntax much worse than Visual Studio's. It's almost as if GCC is trying to discourage users from using assembly... ;) – puppydrum64 Jul 01 '22 at 17:59
Why cast `num` in one place and not all places? – chux - Reinstate Monica Jul 01 '22 at 18:07
Worth noting that unless count in the `[1... CHAR_BIT * sizeof num)` range, code is UB. – chux - Reinstate Monica Jul 01 '22 at 18:08
@chux-ReinstateMonica Because it doesn't make a difference in the other case. As for the undefined behaviour, that's technically true but well known not to be a problem in practice in this particular case. – fuz Jul 01 '22 at 18:47
@puppydrum64 The main difference is that MSVC-style inline assembly was designed for convenience and for being able to write long code sequences with it. Back in the day, users would possibly not have an assembler in their toolchain, so inline assembly had to be heavily used whenever you needed any sort of assembly. Gcc's style on the other hand is designed to patch in individual instructions the compiler doesn't know about and requires you to specify all side effects the instruction has so the optimiser can do its job. It's meant for performance, nothing else. – fuz Jul 01 '22 at 18:50
Hmmm, In the case where `count==CHAR_BIT * sizeof num-1`, "it doesn't make a difference in the other case." --> `num << 31` unnecessarily incurrs UB when `(unsigned) num << 31` does not. – chux - Reinstate Monica Jul 01 '22 at 19:08
@chux-ReinstateMonica Undefined behaviour remains with `count == 32`, so no need to account for the `count == 31` case separately unless you also fix the other one. – fuz Jul 01 '22 at 19:30
@puppydrum64 and fuz: for C that compiles efficiently to `ror` / `rol` with well-defined behaviour for every count, see [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508). This is a solved problem without using inline asm, although it is somewhat tricky to avoid UB and still write something that compilers can recognize as not needing any extra instructions. – Peter Cordes Jul 02 '22 at 01:55
@PeterCordes Thank you. I can't help but wonder though, why ```ror``` and ```rol``` haven't made it to a standard library yet. – puppydrum64 Jul 17 '22 at 17:17
@puppydrum64: C++20's `` header has `std::rotr` and `std::rotl`. https://en.cppreference.com/w/cpp/numeric/rotl. It allows negative rotate counts, but apparently that Just Works on 2's complement machines. https://godbolt.org/z/7z51dxccv shows GCC compiling it to a single `ror` instruction. – Peter Cordes Jul 17 '22 at 17:23

How to write inline assembly to bit rotate

2 Answers2