AT&T syntax uses the opposite order from Intel syntax. The rotate count has to be first, not last: rol $1, %0
.
Also, you don't need and shouldn't use inline asm for this: https://gcc.gnu.org/wiki/DontUseInlineAsm
As described in Best practices for circular shift (rotate) operations in C++, GNU C has intrinsics for narrow rotates, because the rotate-idiom recognition code fails to optimize away an and
of the rotate count. x86 shifts/rotates mask the count with count & 31
even for 8-bit and 16-bit, but rotates still wrap around. It does matter for shifts though.
Anyway, gcc has a builtin function for narrow rotates to avoid any overhead. There's a __rolb
wrapper for it in x86intrin.h
, but MSVC uses its own __rotr8
and so on from its intrin.h
. Anyway, clang doesn't support either the __builtin
or the x86intrin.h
wrappers for rotates, but gcc and ICC do.
#include <stdint.h>
uint8_t rotate_left_byte_by1(uint8_t a) {
return __builtin_ia32_rolqi(a, 1); // qi = quarter-integer
}
I used uint8_t
from stdint.h
like a normal person instead of defining a byte
type.
This doesn't compile at all with clang, but it compiles as you'd hope with gcc7.2:
rotate_left_byte_by1:
movl %edi, %eax
rolb %al
ret
This gives you a function that compiles as efficiently as your inline asm ever could, but which can optimize away completely for compile-time constants, and the compiler knows how it works / what it does and can optimize accordingly.