This is the optimal solution, the AND would require at least two more instructions possibly having to stop and wait for a load to happen of the value to mask. So worse in a couple of ways.
00000000 <swap>:
0: e1a03420 lsr r3, r0, #8
4: e1830400 orr r0, r3, r0, lsl #8
8: e1a00800 lsl r0, r0, #16
c: e1a00820 lsr r0, r0, #16
10: e12fff1e bx lr
00000000 <swap>:
0: ba40 rev16 r0, r0
2: b280 uxth r0, r0
4: 4770 bx lr
The latter is armv7 but at the same time it is because they added instructions to support this kind of work.
Fixed length RISC instructions have by definition a problem with constants. MIPS chose one way, ARM chose another. Constants are a problem on CISC as well just a different problem. Not difficult to create something that takes advantage of ARMS barrel shifter and shows a disadvantage of MIPS solution and vice versa.
The solution actually has a bit of elegance to it.
Part of this as well is the overall design of the target.
unsigned short fun ( unsigned short x )
{
return(x+1);
}
0000000000000010 <fun>:
10: 8d 47 01 lea 0x1(%rdi),%eax
13: c3 retq
gcc chooses not to return the 16 bit variable you asked for it returns a 32 bit, it doesnt properly/correctly implement the function I asked for with my code. But that is okay if when the user of the data gets that result or uses it the mask happens there or with this architecture ax is used instead of eax. for example.
unsigned short fun ( unsigned short x )
{
return(x+1);
}
unsigned int fun2 ( unsigned short x )
{
return(fun(x));
}
0000000000000010 <fun>:
10: 8d 47 01 lea 0x1(%rdi),%eax
13: c3 retq
0000000000000020 <fun2>:
20: 8d 47 01 lea 0x1(%rdi),%eax
23: 0f b7 c0 movzwl %ax,%eax
26: c3 retq
A compiler design choice (likely based on architecture) not an implementation bug.
Note that for a sufficiently sized project, it is easy to find missed optimization opportunities. No reason to expect an optimizer to be perfect (it isnt and cant be). They just need to be more efficient than a human doing it by hand for that sized project on average.
This is why it is commonly said that for performance tuning you dont pre-optimize or just jump to asm immediately you use the high level language and the compiler you in some way profile your way through to find the performance problems then hand code those, why hand code them because we know we can at times out perform the compiler, implying the compiler output can be improved upon.
This isnt a missed optimization opportunity, this is instead a very elegant solution for the instruction set. Masking a byte is simpler
unsigned char fun ( unsigned char x )
{
return((x<<4)|(x>>4));
}
00000000 <fun>:
0: e1a03220 lsr r3, r0, #4
4: e1830200 orr r0, r3, r0, lsl #4
8: e20000ff and r0, r0, #255 ; 0xff
c: e12fff1e bx lr
00000000 <fun>:
0: e1a03220 lsr r3, r0, #4
4: e1830200 orr r0, r3, r0, lsl #4
8: e6ef0070 uxtb r0, r0
c: e12fff1e bx lr
the latter being armv7, but with armv7 they recognized and solved these issues you cant expect the programmer to always use natural sized variables, some feel the need to use less optimal sized variables. sometimes you still have to mask to a certain size.