Inline assembly size mismatch for 8-bit rotate

Question

I am trying to write the rotate left operation in C using inline assembly, like so:

byte rotate_left(byte a) {
    __asm__("rol %0, $1": "=a" (a) : "a" (a));
    return a;
}

(Where byte is typedefed as unsigned char).

This raises the error

/tmp/ccKYcEHR.s:363: Error: operand size mismatch for `rol'.

What is the problem here?

With AT&T syntax src and destination are reversed. Maybe you meant `"rol $1, %0"` — Michael Petch, Dec 27 '17 at 21:02
https://godbolt.org/g/z6Qof7 there's no need of inline assembly for this (at least for gcc and clang) — Matteo Italia, Dec 27 '17 at 21:07

score 3 · Accepted Answer · answered Dec 27 '17 at 21:03

AT&T syntax uses the opposite order from Intel syntax. The rotate count has to be first, not last: rol $1, %0.

Also, you don't need and shouldn't use inline asm for this: https://gcc.gnu.org/wiki/DontUseInlineAsm

As described in Best practices for circular shift (rotate) operations in C++, GNU C has intrinsics for narrow rotates, because the rotate-idiom recognition code fails to optimize away an and of the rotate count. x86 shifts/rotates mask the count with count & 31 even for 8-bit and 16-bit, but rotates still wrap around. It does matter for shifts though.

Anyway, gcc has a builtin function for narrow rotates to avoid any overhead. There's a __rolb wrapper for it in x86intrin.h, but MSVC uses its own __rotr8 and so on from its intrin.h. Anyway, clang doesn't support either the __builtin or the x86intrin.h wrappers for rotates, but gcc and ICC do.

#include <stdint.h>
uint8_t rotate_left_byte_by1(uint8_t a) {
    return __builtin_ia32_rolqi(a, 1);  // qi = quarter-integer
}

I used uint8_t from stdint.h like a normal person instead of defining a byte type.

This doesn't compile at all with clang, but it compiles as you'd hope with gcc7.2:

rotate_left_byte_by1:
    movl    %edi, %eax
    rolb    %al
    ret

This gives you a function that compiles as efficiently as your inline asm ever could, but which can optimize away completely for compile-time constants, and the compiler knows how it works / what it does and can optimize accordingly.

As said above, gcc [does seem to recognize the naive rotate by one](https://godbolt.org/g/z6Qof7), so it doesn't even seem necessary to use any intrinsic; it's curious however that both VC++ and icc fail to recognize it. — Matteo Italia, Dec 27 '17 at 21:14
@MatteoItalia: It recognizes it, but it's hard to get it to emit just `rolb` without an `and` to mask a runtime-variable count. And perhaps the builtins were added before the idiom recognizer could handle byte and 16-bit rotates; I didn't check old gcc versions. — Peter Cordes, Dec 27 '17 at 21:16
Update: VC++ [does recognize it](https://godbolt.org/g/fSn6YK) if I add liberally some casts to `uint8_t` (even just one on the left shift seems to work fine). — Matteo Italia, Dec 27 '17 at 21:16
Uh yes, a runtime-known shift is more tricky; I was just testing the "rotate by fixed one" because that's what OP wrote in his question. — Matteo Italia, Dec 27 '17 at 21:17
Still, 7.2 handles it just fine https://godbolt.org/g/9ZpN1d - no `and` in sight. From a quick binary search, it seems it started to handle it smartly from 4.9.0; 4.8.5 still generated horrible code for that. — Matteo Italia, Dec 27 '17 at 21:18

Inline assembly size mismatch for 8-bit rotate

1 Answers1