How to write rotate code in C to compile into the `ror` x86 instruction?

Question

I have some code that rotates my data. I know GAS syntax has a single assembly instruction that can rotate an entire byte. However, when I try to follow any of the advice on Best practices for circular shift (rotate) operations in C++, my C code compiles into at least 5 instructions, which use up three registers-- even when compiling with -O3. Maybe those are best practices in C++, and not in C?

In either case, how can I force C to use the ROR x86 instruction to rotate my data?

The precise line of code which is not getting compiled to the rotate instruction is:

value = (((y & mask) << 1 ) | (y >> (size-1))) //rotate y right 1
       ^ (((z & mask) << n ) | (z >> (size-n))) // rotate z left by n
// size can be 64 or 32, depending on whether we are rotating a long or an int, and 
// mask would be 0xff or 0xffffffff, accordingly

I do not mind using __asm__ __volatile__ to do this rotate, if that's what I must do. But I don't know how to do so correctly.

If you are *using* the result (output) of an `__asm__` statement, you do not need volatile. In fact, it may hurt program optimization. — Brett Hale, Jul 14 '17 at 17:44
There is no such thing as a "GAS processor". GAS is the name of a multiplatform _assembler_. — zwol, Jul 14 '17 at 17:46
@BrettHale That's why I asked this question-- I am not exactly sure what the best combination of those would be, and how they work. — Alex, Jul 14 '17 at 17:46
@zwol Thanks-- I didn't know. I just assumed Intel processors use Intel syntax, and GAS processors use GAS syntax. I'm just a beginner in this, sorry. — Alex, Jul 14 '17 at 17:48
The typical idiom for a 32 bit rotate is `#define ROR(x,y) ((unsigned)(x) >> (y) | (unsigned)(x) << 32 - (y))`. Is this what you use? Please show us your code. — fuz, Jul 14 '17 at 17:49
@Alex Apparently you don't or it would work. Please show us the actual code you wrote that exhibits this problem. — fuz, Jul 14 '17 at 17:49
@Alex Yeah, it can get confusing. The thing you want to say here is "I know _x86 processors_ have a single assembly instruction that can rotate..." It's true that there are two different syntaxes for the assembly language for x86 processors, "Intel syntax" and "AT&T syntax", but both can express all of the instructions supported by the processor. "GAS syntax" is a bad term to use because GAS, the program, supports _both_ AT&T and Intel syntaxes when "targeting" the x86. (It happens to _default_ to the AT&T syntax, but it was AT&T who invented that syntax, not GNU.) — zwol, Jul 14 '17 at 17:51
@fuz This typical idiom has a problem for rotating by 0, which the programmer might expect to work, but which invokes UB with this definition. Various ways exist to fix this (`(32 - y) % 32` is constant-time, for instance), but then compilers are even less likely to recognize a pattern that becomes quite convoluted. — Pascal Cuoq, Jul 14 '17 at 17:52
@Alex That's the macro I just wrote. The chance that this is the code you tested is about 0%. Please post your actual code. — fuz, Jul 14 '17 at 17:52
GCC and Clang still recognize it if you write `#define ROR(x,y) ((uint32_t)(x) >> (y) | (uint32_t)(x) << ((32 - (y)) & 31))` which is safe (I think?) — harold, Jul 14 '17 at 17:58
@Alex We need to see enough of your code that we can run it through a compiler ourselves and confirm that we're getting the same thing you are. That's not enough. — zwol, Jul 14 '17 at 17:59
@zwol OK, maybe I will delete this question then, and once I pull out enough of my code from my project to create a minimal example, then I will ask again. Thanks for your help. — Alex, Jul 14 '17 at 18:00
Remove `& mask` and watch out for that `>> size` which makes no sense. — harold, Jul 14 '17 at 18:01
@zwol-- or not. It's not allowing me to delete it. In any case, maybe I can ask a new question later, if harold's suggestion does not work. — Alex, Jul 14 '17 at 18:03
@Alex If you unaccept my answer you should be able to delete the question. — zwol, Jul 14 '17 at 18:03
@zwol Apparently not, now that it's upvoted. But anyways, thanks for your help. I just started asking these programming questions, and it's useful to try to see how I need to change my question-asking to make my questions better. — Alex, Jul 14 '17 at 18:06
@Alex [this works](https://godbolt.org/g/ATytC7), in the sense that it compiles to rotates. Of course I'm not sure whether it does what you want, it's not really clear what you want. — harold, Jul 14 '17 at 18:07
@Alex That is still not the full code. Please post the full code. So much that I can compile it and see what the compiler generates. With what you posted, I still have to fill in details with values where the choice is not obvious. — fuz, Jul 14 '17 at 20:29
How old is your compiler? As I noted in the linked question, the UB-safe variable-count rotate idiom (with extra `&` masking of the count) confuses old compilers, like gcc before 4.9. Also, your big expression is maybe confusing the compiler. Write an inline function for rotate, and call it, like `value = rotr32(y&mask, 1) ^ rotr32(z&mask, n);` Much more readable. — Peter Cordes, Jul 14 '17 at 23:26
Also, my answer on the linked question clearly says that it's the best practice for C as well as C++. Put `-xc` in the compiler options in godbolt link. Here's a version where I did that: https://godbolt.org/g/xn27eX. Like the original linked from the best-practices answer, it has a version that uses x86 intrinsics if available. (clang doesn't seem to provide any in `x86intrin.h`, but other compilers have `_rotr` for 32-bit rotates.) — Peter Cordes, Jul 15 '17 at 00:10

score 8 · Answer 1 · answered Jul 14 '17 at 17:55

Your macro compiles to a single ror instruction for me... specifically, I compiled this test file:

#define ROR(x,y) ((unsigned)(x) >> (y) | (unsigned)(x) << 32 - (y))
unsigned ror(unsigned x, unsigned y)
{
    return ROR(x, y);
}

as C, using gcc 6, with -O2 -S, and this is the assembly I got:

    .file   "test.c"
    .text
    .p2align 4,,15
    .globl  ror
    .type   ror, @function
ror:
.LFB0:
    .cfi_startproc
    movl    %edi, %eax
    movl    %esi, %ecx
    rorl    %cl, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   ror, .-ror
    .ident  "GCC: (Debian 6.4.0-1) 6.4.0 20170704"
    .section    .note.GNU-stack,"",@progbits

Please try to do the same, and report the assembly you get. If your test program is substantially different from mine, please tell us how it differs. If you are using a different compiler or a different version of GCC please tell us exactly which one.

Incidentally, I get the same assembly output when I compile the code in the accepted answer for "Best practices for circular shift (rotate) operations in C++", as C.

score 3 · Answer 2 · answered Jul 15 '17 at 00:25

How old is your compiler? As I noted in the linked question, the UB-safe variable-count rotate idiom (with extra & masking of the count) confuses old compilers, like gcc before 4.9. Since you're not masking the shift count, it should be recognized with even older gcc.

Your big expression is maybe confusing the compiler. Write an inline function for rotate, and call it, like

value = rotr32(y & mask, 1) ^ rotr32(z & mask, n);

Much more readable, and may help stop the compiler from trying to do things in the wrong order and breaking the idiom before recognizing it as a rotate.

Maybe those are best practices in C++, and not in C?

My answer on the linked question clearly says that it's the best practice for C as well as C++. They are different languages, but they overlap completely for this, according to my testing.

Here's a version of the Godbolt link using -xc to compile as C, not C++. I had a couple C++isms in the link in the original question, for experimenting with integer types for the rotate count.

Like the original linked from the best-practices answer, it has a version that uses x86 intrinsics if available. clang doesn't seem to provide any in x86intrin.h, but other compilers have _rotl / _rotr for 32-bit rotates, with other sizes available.

Actually, I talked about rotate intrinsics at length in the answer on the best-practices question, not just in the godbolt link. Did you even read the answer there, apart from the code block? (If you did, your question doesn't reflect it.)

Using intrinsics, or the idiom in your own inline function, is much better than using inline asm. Asm defeats constant-propagation, among other things. Also, compilers can use BMI2 rorx dst, src, imm8 to copy-and-rotate with one instruction, if you compile with -march=haswell or -mbmi2. It's a lot harder to write an inline-asm rotate that can use rorx for immediate-count rotates but ror r32, cl for variable-count rotates. You could try with _builtin_constant_p(), but clang evaluates that before inlining, so it's basically useless for meta-programming style choice of which code to use. It works with gcc though. But it's still much better not to use inline asm unless you've exhausted all other avenues (like asking on SO) to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm

Fun fact: the rotate functions in gcc's x86intrin.h are just pure C using the rotate idiom that gcc recognizes. Except for 16-bit rotates, where they use __builtin_ia32_rolhi.

Oh, _that's_ how you can get C out of godbolt. Why didn't I ever think of that? — zwol, Jul 15 '17 at 13:41

Brett Hale · Accepted Answer · 2017-07-14T18:17:00.710

2

You might need to be a bit more specific with what integral type / width you're rotating, and whether you have a fixed or variable rotation. ror{b,w,l,q} (8, 16, 32, 64-bit) has forms for (1), imm8, or the %cl register. As an example:

static inline uint32_t rotate_right (uint32_t u, size_t r)
{
    __asm__ ("rorl %%cl, %0" : "+r" (u) : "c" (r));
    return u;
}

I haven't tested this, it's just off the top of my head. And I'm sure multiple constraint syntax could be used to optimize cases where a constant (r) value is used, so %e/rcx is left alone.

If you're using a recent version of gcc or clang (or even icc). The intrinsics header <x86intrin.h>, may provide __ror{b|w|d|q} intrinsics. I haven't tried them.

edited Jul 14 '17 at 18:17

answered Jul 14 '17 at 18:03

Brett Hale

21,653
2
61
90

1

Yeah, you can do this…but don't. Just write the C code to make the compiler generate it. That's far superior to inline assembly for portability, maintainability, and even speed (it's much easier on the optimizer). Also, you don't need to hardcode a register there, since you're specifying it in the clobbers (though you do need to specify a byte size). And you'd be much better off with separate inputs/outputs, letting the compiler join them if it was advantageous. Well, this is why you shouldn't use inline asm! :-) – Cody Gray - on strike Jul 14 '17 at 18:34
1

@CodyGray - There are no clobbers specified in this asm statement. More to the point - the brief was "how can I force C to use the ROR x86 instruction to rotate my data?" I also don't really need a lecture on writing quality code, thanks. – Brett Hale Jul 14 '17 at 18:41
1

Sorry. My brain said "constraints", but my fingers typed "clobbers". But your paraphrase of the question is utterly inaccurate. The title asks "how to write rotate code in C to compile into the ror x86 instruction", and the question body says he doesn't *mind* using inline asm if he *has* to, which implies that he would rather not if there is an alternative. There is good reason to avoid it, and it is absolutely not necessary. I didn't mean for you to interpret my pointing this out as a lecture, but there is a *lot* of bad advice disseminated on SO about inline asm, and this doesn't help. – Cody Gray - on strike Jul 14 '17 at 18:44
1

You could use `asm ("ror %1, %0" : "+r" (u) : "cI" ((uint8_t)r));` to allow an immediate. But if it does choose a register, put it in `%cl`. (You can use operand-size overrides like `%b1` to get `%cl` instead of `%ecx`, but this may break if it chooses an immediate). I guess for rotate you might as well force the operand-size with the `l` suffix, but you could let it work for whatever register size the compiler gives it by omitting it. Tested and works for compile-time constant or variable: https://godbolt.org/g/Yuf1SM. But definitely don't recommend using inline asm for this!! – Peter Cordes Jul 14 '17 at 23:25
Your suggestion to use the rotate intrinsics is the best part of this answer, for solving problems with the compiler recognizing rotate idioms. Although IIRC, gcc has no `__builtin_ia32_rotate` except for 16-bit rotates, the "intrinsics" are just inline functions that use the normal idiom, like everyone is telling the OP to write :P Anyway, that allows constant-propagation through the rotate, and allows the compiler to use BMI2 `rorx dst, src, imm8` when available. Among other optimization wins. – Peter Cordes Jul 15 '17 at 00:00
1

I just had another look at the best-practices answer I wrote. It already covers the intrinsics situation. I'm really tempted to just close this question as a duplicate of that and say "go read the whole answer". – Peter Cordes Jul 15 '17 at 00:26

Zibri · Answer 4 · 2019-03-30T11:13:20.143

-2

Best Way:

#define rotr32(x, n) (( x>>n  ) | (x<<(64-n)))
#define rotr64(x, n) (( x>>n  ) | (x<<(32-n)))

More generic:

#define rotr(x, n) (( x>>n  ) | (x<<((sizeof(x)<<3)-n)))

And it compiles (in GCC) with exactly the same code as the asm versions below.

For 64 bit:

__asm__ __volatile__("rorq %b1, %0" : "=g" (u64) : "Jc" (cShift), "0" (u64));

or

static inline uint64_t CC_ROR64(uint64_t word, int i)
{
    __asm__("rorq %%cl,%0"
        :"=r" (word)
        :"0" (word),"c" (i));
    return word;
}

edited Mar 30 '19 at 11:13

answered Mar 29 '19 at 20:47

Zibri

9,096
3
52
44

1

You don't need or want `asm volatile`; rotate is a pure function that doesn't need to be re-run with the same inputs. (Or run at all if the result is unused). You want to let compilers do common subexpression elimination (CSE) on it. But it still defeats constant propagation, so it's still usually better to write it in pure C. – Peter Cordes Mar 29 '19 at 22:07
@PeterCordes Yes, that's why I used the latter and checking with objdump the code produced is ok. So no need for ASM at all. – Zibri Mar 30 '19 at 10:54
The first suggestion or code-block in your answer should be the *best* way, not the *worst* way. – Peter Cordes Mar 30 '19 at 10:55
1

That's an improvement, but your macros don't cast `x` to the proper type so using it on signed `x`, or 32-bit `x`, will result in garbage and/or undefined behaviour. (e.g. arithmetic right shift as one of the inputs to the OR). [Best practices for circular shift (rotate) operations in C++](//stackoverflow.com/q/776508) already has good safe versions of these; this question is about the OP confusing the compiler's idiom-recognition with a complicated expression. – Peter Cordes Mar 30 '19 at 11:14
no need for casting... my rotr is meant to work in 8,16,32 and 64 bits... depending on type of x. – Zibri Mar 30 '19 at 11:57
1

Then you need to use `sizeof(x)*CHAR_BIT` instead of hard-coding 32 or 64. But really you want the function (macro) name to set the rotation width, and promote for you. e.g. if you have an `unsigned char`, and want to rotate it to some position in a `uint64_t`, you expect `rotr64` to promote to 64 for you. Also your functions have undefined behaviour for `n=0` (because then you're doing `x<<64`). See the best-practices Q&A. – Peter Cordes Mar 30 '19 at 12:04

How to write rotate code in C to compile into the `ror` x86 instruction?

4 Answers4