How can I elegantly take advantage of ARM instructions like REV and RBIT when writing C code?

Question

I am writing C code which may be compiled for the Arm Cortex-M3 microcontroller.

This microcontroller supports several useful instructions for efficiently manipulating bits in registers, including REV*, RBIT, SXT*.

When writing C code, how can I take advantage of these instructions if I need those specific functions? For example, how can I complete this code?

#define REVERSE_BIT_ORDER(x)    {  /* what to write here? */ }

I would like to do this without using inline assembler so that this code is both portable, and readable.

Added:

In part, I am asking how to express such a function in C elegantly. For example, it's easy to express bit shifting in C, because it's built into the language. Likewise, setting or clearing bits. But bit reversal is unknown in C, and so is very hard to express. For example, this is how I would reverse bits:

unsigned int ReverseBits(unsigned int x)
{
    unsigned int ret = 0;
    for (int i=0; i<32; i++)
    {
        ret <<= 1;
        if (x & (1<<i))
            ret |= 1;
    }
    return ret;
}

Would the compiler recognise this as bit reversal, and issue the correct instruction?

Use a version of [GCC that specifically target your core](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain). — Gerhard, Jan 09 '23 at 10:48
`sxt*` is trivial, the compiler will use those to implement casts from narrow signed types to int or unsigned. To get it to emit REV or RBIT, you might need an intrinsic. `endian.h` for REV (or [Does ARM GCC have a builtin function for the assembly 'REV' instruction?](https://stackoverflow.com/q/35133829)), but RBIT might need something ARM-specific. You wouldn't want inline asm, though; `__rbit` should be in some header if GCC has a compatible version: https://developer.arm.com/documentation/dui0472/k/Compiler-specific-Features/--rbit-intrinsic — Peter Cordes, Jan 09 '23 at 10:53
Before worrying about such details, perhaps start with the basics of embedded programming: do not use signed types for bitwise arithemtic and always use stdint.h. `1<<31` invokes undefined behavior. A function without return where the caller uses the result invokes undefined behavior. And so on. Worry about the code being a buggy mess before worrying about micro-optimizations. — Lundin, Jan 09 '23 at 11:51
@Lundin - Point taken, crappy code corrected. Question still stands. — Rocketmagnet, Jan 09 '23 at 11:53
@Lundin - I feel like we're getting lost in micro optimisations of the question here, rather than addressing the spirit of the question. This was just supposed to be a toy example to represent the difficulty of expressing "bit reversal" to the compiler in a way that might give it the opportunity to emit the RBIT instruction. If you don't know the answer to my question, that's fine, nor do I. But please let's not clutter up the comments with pedantry. — Rocketmagnet, Jan 09 '23 at 12:20
In ISO C++20, they finally added `std::byteswap`, so that covers `rev` and `revsh` (uint32_t and uint16_t). https://en.cppreference.com/w/cpp/numeric/byteswap . You'd still need an intrinsic to swap the upper 16 while leaving the low 16 unmodified. In ISO C you'd be dependent on the compiler recognizing an idiom and doing a peephole optimization to bit-reverse or byte-reverse. In a more modern language like Rust, `foo.reverse_bits()` (since Rust 1.37) and `foo.swap_bytes()` just work for any type on any ISA. https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits — Peter Cordes, Jan 09 '23 at 13:35
I disassembled your code on an ARM target to see what it would generate, and then I noticed that it generated nonsense code with arithmetic shifts. Fixing the bugs gave different machine code. So it is highly relevant to the question. — Lundin, Jan 09 '23 at 13:36
You can see that no, neither GCC nor clang optimize your code to `rbit` for ARM or AArch64. https://godbolt.org/z/Y7noP61dE . Presumably looping over bits in the other direction isn't any better. Perhaps a bithack? GCC and clang recognize the standard bithack for popcount. — Peter Cordes, Jan 09 '23 at 13:42
@Lundin - This isn't code that I'm using in practice. Again, that was just a *toy* example to demonstrate the inelegance and difficulty of representing something like "bit swap" as opposed to the elegance and ease of representing something like "bit shift". If you know the answer to the question, I'd be happy to hear it. If you don't know the answer to the question, or can't understand it due to a bug in the toy example you are welcome to not answer it. — Rocketmagnet, Jan 09 '23 at 13:42

Lundin · Answer 1 · 2023-01-10T07:52:41.133

4

Reversing bits in a 32 bit integer is such an exotic instruction so that might be why you can't reproduce it. I was able to generate code that utilizes REV (reverse byte order) however, which is a far more common use-case:

#include <stdint.h>

uint32_t endianize (uint32_t input)
{
  return ((input >> 24) & 0x000000FF) |
         ((input >>  8) & 0x0000FF00) |
         ((input <<  8) & 0x00FF0000) |
         ((input << 24) & 0xFF000000) ;
}

With gcc -O3 -mcpu=cortex-m3 -ffreestanding (for ARM32, vers 11.2.1 "none"):

endianize:
        rev     r0, r0
        bx      lr

https://godbolt.org/z/odGqzjTGz

It works for clang armv7-a 15.0.0 too, long as you use -mcpu=cortex-m3.

So this would support the idea of avoiding manual optimizations and let the compiler worry about such.

edited Jan 10 '23 at 07:52

answered Jan 09 '23 at 14:26

Lundin

195,001
40
254
396

As a side note, I noticed that gcc "none" fails to take advantage of rev unless you use `-mcpu=cortex-m3`, whereas the same ARM gcc for 32 bit Linux uses rev as default. – Lundin Jan 09 '23 at 14:28
UV - good example. I would also add -mthumb tho the command line (most M3 uC support only thumb instructions) – 0___________ Jan 09 '23 at 14:34
@0___________ It doesn't seem to make a difference in this case. I noticed that the ARM manual says "You can use SP in ARM instructions but this is deprecated in ARMv6T2 and above. You cannot use SP in Thumb instructions." So thumb might make a lil bit of difference for some other very specific scenario? – Lundin Jan 09 '23 at 14:39
It wasn't criticism only remark. It is good example – 0___________ Jan 09 '23 at 14:52
Thanks Lundin, this is quite a good example. Actually, it also illustrates the problems with writing code and hoping that the compiler will produce the correct optimisations. In this case it works correctly with the ARM architecture, but I have implemented exactly this function on an 8-bit architecture, and it produced spectacularly inefficient code. In that case, I used a union instead. We keep talking about "manual optimisations", rather than "writing readable and portable code which the compiler can optimise correctly when it has the chance". – Rocketmagnet Jan 09 '23 at 16:03
@Rocketmagnet 8 bitter compilers will almost always produce horrible code for 32 bit arithmetic. Actually swapping my Godbolt example to gcc for AVR results in `rcall __bswapsi2` and what that internal function does, I don't know. (I did find the implementation of __bswapsi2 for some llvm lib and it looks identical to my code.) – Lundin Jan 09 '23 at 16:11
`asm("rev r0,r0")` would not be remotely safe, please don't give broken examples of inline asm. You need `asm("rev %0,%1" : "=r"(input) : "r"(input))` to tell the compiler what registers are inputs and outputs, so this works even after inlining. Or if you're using compiler-specific stuff anyway, much better to `return __builtin_bswap32(input)` as in [Does ARM GCC have a builtin function for the assembly 'REV' instruction?](https://stackoverflow.com/q/35133829), which is portable GNU C that works on any ISA. Not as portable as your pure ISO C, though. – Peter Cordes Jan 10 '23 at 07:17
@PeterCordes Of course it won't (and inline asm isn't even standardized in C). That was just added as an afterthought and I have no intention of providing working inline assembler, since that's not in the slightest related to the question. – Lundin Jan 10 '23 at 07:38
@Lundin: An unsafe / broken inline asm statement is worse than nothing for inexperienced readers who don't know any better, especially if you don't *say* that it's broken. I don't mind if you don't like my edit, but you should fix your answer to not imply that `asm("rev r0,r0")` could actually work. – Peter Cordes Jan 10 '23 at 07:48
@PeterCordes Fine, I removed that part. – Lundin Jan 10 '23 at 07:53
Thanks, I think that's an important improvement. Inline asm seems to be hard enough for many beginners without incorrect examples floating around. (Or *because* of incorrect examples people have seen that happened to work in some case.) – Peter Cordes Jan 10 '23 at 07:54

score 1 · Answer 2 · answered Jan 09 '23 at 11:18

1

It would be best if you used CMSIS intrinsic.

__REV, __REV16 etc. Those CMSIS header files contain much much more.

You can get them from here:

https://github.com/ARM-software/CMSIS_5

and you are looking for cmsis_gcc.h file (or similar if you use another compiler).

answered Jan 09 '23 at 11:18

0___________

60,014
4
34
74

Thank you for your answer. What I was asking was if there was an elegant, readable and *portable* way to take advantage of these instructions without using hardware specific code (like inline assembler or intrinsics). – Rocketmagnet Jan 09 '23 at 11:24
1

@Rocketmagnet there is no portable way of forcing the compiler to use specific machine code instructions. But do not worry - you have limited choice of compilers when you target Cortex derived. And most of them are covered by ARM. So it is portable way as you can use any ARM Cortex targeting compiler – 0___________ Jan 09 '23 at 11:26
"Force" is a strong word. But if my application involved a lot of bit reversal in time critical code, and my hardware has an instruction to do that, will the compiler naturally take advantage of that? And is there some way to express the desire for bit reversal which will issue the instruction if it's available? – Rocketmagnet Jan 09 '23 at 11:30
For example, GCC can actually recognise code to count the number of bits in a word, and replace that with the popcount instruction if it exists: https://youtu.be/bSkpMdDe4g4?t=2382. But would the same happen for these Cortex-M3 instructions? – Rocketmagnet Jan 09 '23 at 11:34
Use `#if` s to make it portble – 0___________ Jan 09 '23 at 11:50
It's a very interesting video if you're interested in code performance. If not, here's a little article written by Matt Godbolt about Compiler Explorer: https://xania.org/202206/happy-birthday-ce – Rocketmagnet Jan 09 '23 at 11:57

Peter Cordes · Accepted Answer · 2023-01-10T11:25:16.587

@Lundin's answer shows a pure-C shift/mask bithack that clang recognizes and compiles to a single rev instruction. (Or presumably to x86 bswap if targeting x86, or equivalent instructions on other ISAs that have them.)

In portable ISO C, hoping for pattern-recognition is unfortunately the best you can do, because they haven't added portable ways to expose CPU functionality; even C++ took until C++20 to add the <bit> header for things like std::popcount and C++23 std::byteswap.

(Some fairly-portable C libraries / headers have byte-reversal, e.g. as part of networking there's ntohl net-to-host which is an endian-swap on little-endian machines. Or there's GCC's (or glibc's?) endian.h, with htobe32 being host to big-endian 32-bit. Man page. These are usually implemented with intrinsics that compile to a single instruction in good-quality implementations.

Of course, if you definitely want a byte swap regardless of host endianness, you could do htole32(be32toh(x)) because one of them's a no-op and the other's a byte-swap, since ARM is either big or little endian. (It's still a byte-swap even if neither of them are NOPs, even on PDP or other mixed-endian machines, but there might be more efficient ways to do it.)

There are also some "collections of useful functions" headers with intrinsics for different compilers, with functions like byte swap. These can be of varying quality in terms of efficiency and maybe even correctness.

You can see that no, neither GCC nor clang optimize your code to rbit for ARM or AArch64. https://godbolt.org/z/Y7noP61dE . Presumably looping over bits in the other direction isn't any better. Perhaps a bithack as in In C/C++ what's the simplest way to reverse the order of bits in a byte? or Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C .

CC and clang recognize the standard bithack for popcount, but I didn't check any of the answers on the bit-reverse questions.

Some languages, notably Rust, do care more about making it possible to portably express what modern CPUs can do. foo.reverse_bits() (since Rust 1.37) and foo.swap_bytes() just work for any type on any ISA. For u32 specifically, https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits (That's Rust's equivalent of C uint32_t.)

Most mainstream C implementations have portable (across ISAs) builtins or (target-specific) intrinsics (like __REV() or __REV16() for stuff like this.

The GNU dialect of C (GCC/clang/ICC and some others) includes __builtin_bswap32(input). See Does ARM GCC have a builtin function for the assembly 'REV' instruction?. It's named after the x86 bswap instruction, but it's just a byte-reverse that GCC / clang compile to whatever instructions can do it efficiently on the target ISA.

There's also a __builtin_bswap16(uint16_t) for swapping the bytes of a 16-bit integer, like revsh except the C semantics don't include preserving the upper 16 bits of a 32-bit integer. (Because normally you don't care about that part.) See the GCC manual for the available GNU C builtins that aren't target-specific.

There isn't a GNU C builtin or intrinsic for bitwise reverse that I could find in the manual or GCC arm-none-eabi 12.2 headers.

ARM documents an __rbit() intrinsic for their own compiler, but I think that's Keil's ARMCC, so there might not be any equivalent of that for GCC/clang.

@0___________ suggests https://github.com/ARM-software/CMSIS_5 for headers that define a function for that.

If worst comes to worst, GNU C inline asm is possible for GCC/clang, given appropriate #ifdefs. You might also want if (__builtin_constant_p(x)) to use a pure-C bit-reversal so constant-propagation can happen on compile-time constants, only using inline asm on runtime-variable values.

   uint32_t output, input=...;
#if defined(__arm__) || defined (__aarch64__)
   // same instruction is valid for both
   asm("rbit %0,%1" : "=r"(output) : "r"(input));
#else 
 ...  // pure C fallback or something
#endif

Note that it doesn't need to be volatile because rbit is a pure function of the input operand. It's a good thing if GCC/clang are able to hoist this out of a loop. And it's a single asm instruction so we don't need an early-clobber.

This has the downside that the compiler can't fold a shift into it, e.g. if you wanted a byte-reverse, __rbit(x) >> 24 equals __rbit(x<<24), which could be done with rbit r0, r1, lsl #24. (I think).

With inline asm I don't think there's a way to tell the compiler that a r1, lsl #24 is a valid expansion for the %1 input operand. Hmm, unless there's a machine-specific constraint for that? https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html - no, no mention of "shifted" or "flexible" source operand in the ARM section.

Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C shows an #ifdefed version with a working fallback that uses a bithack to reverse bits within a byte, then __builtin_bswap32 or MSVC _byteswap_ulong to reverse bytes.

This is an excellent answer, and just what I was looking for. Many thanks. — Rocketmagnet, Jan 10 '23 at 10:48

score 1 · Answer 4 · answered Jan 12 '23 at 20:47

Interestingly, ARM gcc seems to have improved its detection of byte order reversing recently. With version 11, it would detect byte reversal if done by bit shifting, or by byte swapping through a pointer. However, from version 10 and backwards, the pointer method failed to issue the REV instruction.

    uint32_t endianize1 (uint32_t input)
    {
      return ((input >> 24) & 0x000000FF) |
             ((input >>  8) & 0x0000FF00) |
             ((input <<  8) & 0x00FF0000) |
             ((input << 24) & 0xFF000000) ;
    }

    uint32_t endianize2 (uint32_t input)
    {
        uint32_t output;
        uint8_t *in8  = (uint8_t*)&input;
        uint8_t *out8 = (uint8_t*)&output;

        out8[0] = in8[3];
        out8[1] = in8[2];
        out8[2] = in8[1];
        out8[3] = in8[0];
        return output;
    }

endianize1:
    rev     r0, r0
    bx      lr
endianize2:
    mov     r3, r0
    movs    r0, #0
    lsrs    r2, r3, #24
    bfi     r0, r2, #0, #8
    ubfx    r2, r3, #16, #8
    bfi     r0, r2, #8, #8
    ubfx    r2, r3, #8, #8
    bfi     r0, r2, #16, #8
    bfi     r0, r3, #24, #8
    bx      lr

https://godbolt.org/z/E3xGvG9qq

So, as we wait for optimisers to improve, there are certainly ways you can help the compiler understand your intent and take good advantage of the instruction set (without resorting to micro optimisations or inline assembler). But it's likely that this will involve a good understanding of the architecture by the programmer, and examination of the output assembler.

Take advantage of http://godbolt.org to help examine the compiler output, and see what produces the best output.

How can I elegantly take advantage of ARM instructions like REV and RBIT when writing C code?

4 Answers4

Linked