Why does ARM use two instructions to mask a value?

Question

For the following function...

uint16_t swap(const uint16_t value)
{
    return value << 8 | value >> 8;
}

...why does ARM gcc 6.3.0 with -O2 yield the following assembly?

swap(unsigned short):
  lsr r3, r0, #8
  orr r0, r3, r0, lsl #8
  lsl r0, r0, #16         # shift left
  lsr r0, r0, #16         # shift right
  bx lr

It appears the compiler is using two shifts to mask off the unwanted bytes, instead of using a logical AND. Could the compiler instead use and r0, r0, #4294901760?

Not an expert, but I guess loading the constant would take more time that working with values already in register. Like using xor r0, r0 instead of loading 0 — ZeroUltimax, Dec 12 '17 at 20:16
There exist many situations where a compiler *could* sometimes do something differently and better, but either it cannot see that the optimization can be applied or it has just not been implemented or it doesn't actually matter in real life so it doesn't bother or it's a speed/space tradeoff, etc. Compilers aren't magic. If you think it could do better; get involved and submit a patch for your compiler, for the benefit of us all :) — Jesper Juhl, Dec 12 '17 at 20:20
On ARMv6 or later, the compiler should just emit `REV16 r0,r0` for this. — fuz, Dec 12 '17 at 20:23
@fuz Your comment shows that you have not understood the problem. — EOF, Dec 12 '17 at 20:51
Older ARM asm cannot create constants easily. Instead, they are loaded into literal pools and then read in via a memory load. This `and` you suggest can only take I believe an 8-bit literal with shift. Your 0xFFFF0000 requires 16-bits to do as 1 instructions. So, we can load from memory and and (slow), take 2 instructions to create the value and 1 to and (longer), or just shift twice cheaply. — Michael Dorgan, Dec 12 '17 at 20:54
@fuz: Besides that, with `-march=armv6`, gcc will use `uxth r0, r0` (https://godbolt.org/g/P3ASv3) to truncate to 16 bits, in case the caller left high garbage in `r0`. So ARMv6 does answer the actual question as well. — Peter Cordes, Dec 13 '17 at 00:08
@EOF: Don't be so sure about that: The calling convention guarantees that inputs / outputs have their upper bits zeroed, and `rev16` doesn't disturb that. So using `rev16` instead of shift/or avoids needing zero-extension. (GCC is dumb and use `uxth` anyway though.) So Fuz is correct: a compiler *should* just emit `rev16 r0,r0`. — Peter Cordes, Dec 13 '17 at 01:32

Michael Dorgan · Accepted Answer · 2017-12-15T18:34:16.113

Older ARM assembly cannot create constants easily. Instead, they are loaded into literal pools and then read in via a memory load. This and you suggest can only take I believe an 8-bit literal with shift. Your 0xFFFF0000 requires 16-bits to do as 1 instructions.

So, we can load from memory and do an and (slow), Take 2 instructions to create the value and 1 to and (longer), or just shift twice cheaply and call it good.

The compiler chose the shifts and honestly, it is plenty fast.

Now for a reality check:

Worrying about a single shift, unless this is a 100% for sure bottleneck is a waste of time. Even if the compiler was sub-optimal, you will almost never feel it. Worry about "hot" loops in code instead for micro-ops like this. Looking at this from curiosity is awesome. Worrying about this exact code for performance in your app, not so much.

Edit:

It has been noted by others here that newer versions of the ARM specifications allow this sort of thing to be done more efficiently. This shows that it is important, when talking at this level, to specify the Chip or at least the exact ARM spec we are dealing with. I was assuming ancient ARM from the lack of "newer" instructions given from your output. If we are tracking compiler bugs, then this assumption may not hold and knowing the specification is even more important. For a swap like this, there are indeed simpler instructions to handle this in later versions.

Edit 2

One thing that could be done to possibly make this faster is to make it inline'd. In that case, the compiler could interleave these operations with other work. Depending on the CPU, this could double the throughput here as many ARM CPUs have 2 integer instruction pipelines. Spread out the instructions enough so that there are no hazards, and away it goes. This has to be weighed against I-Cache usage, but in a case where it mattered, you could see something better.

The reason to worry is not one specific app, it's whether you should report a [missed-optimization gcc bug](https://gcc.gnu.org/bugzilla/buglist.cgi?quicksearch=missed-optimization&list_id=195134) so the compiler can generate slightly faster and/or smaller code for *everyone* in the future. — Peter Cordes, Dec 13 '17 at 00:04
Why does the compiler bother with the upper half (so I understand it) at all? The uint16_t part of the register surely is correctly set. Does the EABI prescribe return values given back in whole registers, not only in portions if sizeof(return val) < sizeof(register)? — Vroomfondel, Dec 13 '17 at 14:25
ARM does not have 16-bit registers, so it needs to clear the top bits away. I am pretty certain that returning "garbage" in the upper half of the register would be against the rules, 16-bit width specified or no. — Michael Dorgan, Dec 14 '17 at 17:41
Inlining such a tiny function is probably a win most of the time, although ARM is uniquely able to save / restore more call-preserved registers without extra code size (more set bits in push/pop). Still, a function call takes instructions at each call site, too, and clobbers R0-R3, so it's close even purely for code-size. The compiler can probably save an instruction if it doesn't need to zero the high 16, e.g. because it knows it's using `strh` on the resulting value. (@Vroomfondel: yes, see my answer, testing with simpler functions shows the ABI requires zero-extended inputs/outputs) — Peter Cordes, Dec 15 '17 at 18:52

score 3 · Answer 2 · answered Dec 13 '17 at 01:30

There is a missed-optimization here, but and isn't the missing piece. Generating a 16-bit constant isn't cheap. For a loop, yes it would be a win to generate a constant outside the loop and use just and inside the loop. (TODO: call swap in a loop over an array and see what kind of code we get.)

For an out-of-order CPU, it could also be worth using multiple instructions off the critical path to build a constant, then you only have one AND on the critical path instead of two shifts. But that's probably rare, and not what gcc chooses.

AFAICT (from looking at compiler output for simple functions), the ARM calling convention guarantees there's no high garbage in input registers, and doesn't allow leaving high garbage in return values. i.e. on input, it can assume that the upper 16 bits of r0 are all zero, but must leave them zero on return. The value << 8 left shift is thus a problem, but the value >> 8 isn't (it doesn't have to worry about shifting garbage down into the low 16).

(Note that x86 calling conventions aren't like this: return values are allowed to have high garbage. (Maybe because the caller can simply use the 16-bit or 8-bit partial register). So are input values, except as an undocumented part of the x86-64 System V ABI: clang depends on input values being sign/zero extended to 32-bit. GCC provides this when calling, but doesn't assume as a callee.)

ARMv6 has a rev16 instruction which byte-swaps the two 16-bit halves of a register. If the upper 16 bits are already zeroed, they don't need to be re-zeroed, so gcc -march=armv6 should compile the function to just rev16. But in fact it emits a uxth to extract and zero-extend the low half-word. (i.e. exactly the same thing as and with 0x0000FFFF, but without needing a large constant). I believe this is pure missed optimization; presumably gcc's rotate idiom, or its internal definition for using rev16 that way, doesn't include enough info to let it realize the top half stays zeroed.

swap:                @@ gcc6.3 -O3 -march=armv6 -marm
    rev16   r0, r0
    uxth    r0, r0     @ not needed
    bx      lr

For ARM pre v6, a shorter sequence is possible. GCC only finds it if we hand-hold it towards the asm we want:

// better on pre-v6, worse on ARMv6 (defeats rev16 optimization)
uint16_t swap_prev6(const uint16_t value)
{
    uint32_t high = value;
    high <<= 24;            // knock off the high bits
    high >>= 16;            // and place the low8 where we want it
    uint8_t low = value >> 8;
    return high | low;
    //return value << 8 | value >> 8;
}


swap_prev6:            @ gcc6.3 -O3 -marm.   (Or armv7 -mthumb for thumb2)
    lsl     r3, r0, #24
    lsr     r3, r3, #16
    orr     r0, r3, r0, lsr #8
    bx      lr

But this defeats the gcc's rotate-idiom recognition, so it compiles to this same code even with -march=armv6 when the simple version compiles to rev16 / uxth.

All source + asm on the Godbolt compiler explorer

Jake 'Alquimista' LEE · Answer 3 · 2017-12-16T23:18:45.487

2

ARM is a RISC machine (Advanced RISC Machine), and thus, all instrutcions are encoded in the same size, capping at 32bit.

Immediate values in instructions are assigned to a certain number of bits, and AND instruction simply doesn't have enought bits assigned to immediate values to express any 16bit value.

That's the reason for the compiler resorting to two shift instructions instead.

However, if your target CPU is ARMv6 (ARM11) or higher, the compiler takes leverage from the new REV16 instruction, and then masks the lower 16bit by UXTH instruction which is unnecessary and stupid, but there is simply no conventional way to persuade the compiler not to do this.

If you think that you would be served well by GCC intrinsic __builtin_bswap16, you are dead wrong.

uint16_t swap(const uint16_t value)
{
    return __builtin_bswap16(value);
}

The function above generates exactly the same machine code that your original C code did.

Even using inline assembly doesn't help either

uint16_t swap(const uint16_t value)
{
    uint16_t result;
    __asm__ __volatile__ ("rev16 %[out], %[in]" : [out] "=r" (result) : [in] "r" (value));
    return result;
}

Again, exactly the same. You cannot get rid of the pesky UXTH as long as you use GCC; It simply cannot read from the context that the upper 16bits are all zeros to start with and thus, UXTH is unnecessary.

Write the whole function in assembly; That's the only option.

edited Dec 16 '17 at 23:18

answered Dec 16 '17 at 23:08

Jake 'Alquimista' LEE

6,197
2
17
25

You don't need or want `volatile` on that asm statement: it's a pure function, and you don't need the compiler to re-run `rev16` if it already has an output value for the same input. i.e. you want to let the compiler CSE multiple `swap(x)` for the same `x`. You can get rid of the `uxth` if you write the function with `uint32_t` args, but presumably that just offloads the `uxth` onto the caller in most cases. I tried to use `uint32_t result` and `if (result > 0xFFFFu) __builtin_unreachable();` to promise the compiler that the upper 2 bytes stay zero, but no luck. https://godbolt.org/g/NEidyW – Peter Cordes Dec 17 '17 at 11:34
If you're on ARMv6, it's probably better to let rev16/uxth inline than to actually write the function in asm. Especially if constant-propagation through a swap can ever happen. Or maybe write a C wrapper that uses `__builtin_constant_p` to decide whether to call the asm version or use a pure C swap. – Peter Cordes Dec 17 '17 at 11:36
What I meant was the whole function, including the caller, and the caller of the caller, and the caller of the caller....... :-) – Jake 'Alquimista' LEE Dec 17 '17 at 23:14
`__builtin_bswap16(value)` will generate 'rev16' if you have a recent gcc version and give it proper compiler options (`-march=armv7-a` for instance). Maybe [godbolt](https://godbolt.org/g/9KqJ7b) or [Does ARM have a builtin_rev](https://stackoverflow.com/questions/35133829/does-arm-gcc-have-a-builtin-function-for-the-assembly-rev-instruction) are relevant. It is useful to get working as the code will be able to update over ARM CPUs and/or could be used with different CPUs and GCC. – artless noise Dec 18 '17 at 15:23
@artlessnoise It took almost 20 years until GCC is finally able to compile it properly then. – Jake 'Alquimista' LEE Dec 19 '17 at 05:15

score 0 · Answer 4 · answered Dec 13 '17 at 00:56

This is the optimal solution, the AND would require at least two more instructions possibly having to stop and wait for a load to happen of the value to mask. So worse in a couple of ways.

00000000 <swap>:
   0:   e1a03420    lsr r3, r0, #8
   4:   e1830400    orr r0, r3, r0, lsl #8
   8:   e1a00800    lsl r0, r0, #16
   c:   e1a00820    lsr r0, r0, #16
  10:   e12fff1e    bx  lr

00000000 <swap>:
   0:   ba40        rev16   r0, r0
   2:   b280        uxth    r0, r0
   4:   4770        bx  lr

The latter is armv7 but at the same time it is because they added instructions to support this kind of work.

Fixed length RISC instructions have by definition a problem with constants. MIPS chose one way, ARM chose another. Constants are a problem on CISC as well just a different problem. Not difficult to create something that takes advantage of ARMS barrel shifter and shows a disadvantage of MIPS solution and vice versa.

The solution actually has a bit of elegance to it.

Part of this as well is the overall design of the target.

unsigned short fun ( unsigned short x )
{
    return(x+1);
}

0000000000000010 <fun>:
  10:   8d 47 01                lea    0x1(%rdi),%eax
  13:   c3                      retq

gcc chooses not to return the 16 bit variable you asked for it returns a 32 bit, it doesnt properly/correctly implement the function I asked for with my code. But that is okay if when the user of the data gets that result or uses it the mask happens there or with this architecture ax is used instead of eax. for example.

unsigned short fun ( unsigned short x )
{
    return(x+1);
}

unsigned int fun2 ( unsigned short x )
{
    return(fun(x));
}


0000000000000010 <fun>:
  10:   8d 47 01                lea    0x1(%rdi),%eax
  13:   c3                      retq   

0000000000000020 <fun2>:
  20:   8d 47 01                lea    0x1(%rdi),%eax
  23:   0f b7 c0                movzwl %ax,%eax
  26:   c3                      retq

A compiler design choice (likely based on architecture) not an implementation bug.

Note that for a sufficiently sized project, it is easy to find missed optimization opportunities. No reason to expect an optimizer to be perfect (it isnt and cant be). They just need to be more efficient than a human doing it by hand for that sized project on average.

This is why it is commonly said that for performance tuning you dont pre-optimize or just jump to asm immediately you use the high level language and the compiler you in some way profile your way through to find the performance problems then hand code those, why hand code them because we know we can at times out perform the compiler, implying the compiler output can be improved upon.

This isnt a missed optimization opportunity, this is instead a very elegant solution for the instruction set. Masking a byte is simpler

unsigned char fun ( unsigned char x )
{
    return((x<<4)|(x>>4));
}

00000000 <fun>:
   0:   e1a03220    lsr r3, r0, #4
   4:   e1830200    orr r0, r3, r0, lsl #4
   8:   e20000ff    and r0, r0, #255    ; 0xff
   c:   e12fff1e    bx  lr

00000000 <fun>:
   0:   e1a03220    lsr r3, r0, #4
   4:   e1830200    orr r0, r3, r0, lsl #4
   8:   e6ef0070    uxtb    r0, r0
   c:   e12fff1e    bx  lr

the latter being armv7, but with armv7 they recognized and solved these issues you cant expect the programmer to always use natural sized variables, some feel the need to use less optimal sized variables. sometimes you still have to mask to a certain size.

The calling conventions are different on ARM vs. x86. ARM apparently requires inputs and outputs to be zero extended, while x86 allows high garbage in inputs and outputs. So `lea 0x1(%rdi),%eax` isn't "wrong" or doing something other than what you asked for. Anyway, isn't `rev16 r0, r0` / `uxth r0,r0` a missed optimization? The upper half stays zero, so all you need is `rev16`. — Peter Cordes, Dec 13 '17 at 01:01
understood it isnt generating a 16 bit result is the point, and as demonstrated it does the right thing overall (obviously, gcc would be unusable on x86 otherwise). No reason why the ARM or any other backend couldn't have done the same thing to save code overall, was a design decision not to. — old_timer, Dec 13 '17 at 11:07
But that's the thing with assembly language: it *is* generating a 16-bit result, because that's how binary / 2''s complement addition works ([and several other operations where high garbage doesn't affect the low bits](https://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in)). The fact that it's also generating an 8-bit and 32-bit result, and not truncating it to 16 bits, is not important. (I know what you mean from your comment, now I'm just nit-picking over wording I guess.) — Peter Cordes, Dec 13 '17 at 11:12
But as far as the ARM backend skipping the truncation step, no, it can't, because the ARM calling convention requires the return value to be truncated to 16 bits. *This* (the calling convention) is the relevant design decision, which was probably taken to reduce overall code-size. (More call sites than return paths for most functions, I guess). — Peter Cordes, Dec 13 '17 at 11:14
I understand the nit pick...the entire point was to show that it was returning some garbage as you say, an incorrect result as I say because it can overflow the 16 bit result. — old_timer, Dec 13 '17 at 11:14
But on x86 with the System V calling convention (as written, not the unofficial extend-to-32 convention that gcc and clang follow, and clang depends on) the *inputs* are also allowed to contain high garbage. So the high bits are doubly incorrect even without overflow. — Peter Cordes, Dec 13 '17 at 11:16
the calling convention a compiler chooses to use is up to the compiler authors, nobody else. they happened to choose one solution, with a x86 style solution they could save instructions at the higher level as x86 does. — old_timer, Dec 13 '17 at 11:16
you are getting hung up in existing compiler conventions and I am talking about something else. — old_timer, Dec 13 '17 at 11:17
For x86-64 System V, [yes gcc devs were involved in designing the calling convention](https://stackoverflow.com/questions/4429398/why-does-windows64-use-a-different-calling-convention-from-all-other-oses-on-x86/35619528#35619528). For 32-bit, the i386 System V ABI was inherited from earlier systems, I think before gcc was ported to that target. For gcc targeting the Windows x86-64 calling convention, gcc has zero choice (except for private `static` functions with no callers outside a compilation unit. It's a minor missed optimization when they don't customize the calling convention there). — Peter Cordes, Dec 13 '17 at 11:19
I don't think a C implementation that wasn't compatible with others on the same platform would get very much use. Maybe it would be an option on some embedded systems, which is why you're arguing that compilers could just make up whatever calling convention they want. It's usually not that simple. — Peter Cordes, Dec 13 '17 at 11:21

Why does ARM use two instructions to mask a value?

4 Answers4

All source + asm on the Godbolt compiler explorer

Linked