10

When ARM gcc 9.2.1 is given command line options -O3 -xc++ -mcpu=cortex-m0 [compile as C++] and the following code:

unsigned short adjust(unsigned short *p)
{
    unsigned short temp = *p;
    temp -= temp>>15;
    return temp;
}

it produces the reasonable machine code:

    ldrh    r0, [r0]
    lsrs    r3, r0, #15
    subs    r0, r0, r3
    uxth    r0, r0
    bx      lr

which is equivalent to:

unsigned short adjust(unsigned short *p)
{
    unsigned r0,r3;
    r0 = *p;
    r3 = temp >> 15;
    r0 -= r3;
    r0 &= 0xFFFFu;   // Returning an unsigned short requires...
    return r0;       //  computing a 32-bit unsigned value 0-65535.
}

Very reasonable. The last "uxtw" could actually be omitted in this particular case, but it's better for a compiler that can't prove the safety of such optimizations to err on the side of caution than risk returning a value outside the range 0-65535, which could totally sink downstream code.

When using -O3 -xc -mcpu=cortex-m0 [identical options, except compiling as C rather than C++], however, the code changes:

    ldrh    r3, [r0]
    movs    r2, #0
    ldrsh   r0, [r0, r2]
    asrs    r0, r0, #15
    adds    r0, r0, r3
    uxth    r0, r0
    bx      lr

unsigned short adjust(unsigned short *p)
{
    unsigned r0,r2,r3;
    r3 = *p;
    r2 = 0;
    r0 = ((unsigned short*)p)[r2];
    r0 = ((int)r0) >> 15;  // Effectively computes -((*p)>>15) with redundant load
    r0 += r3
    r0 &= 0xFFFFu;     // Returning an unsigned short requires...
    return temp;       //  computing a 32-bit unsigned value 0-65535.
}

I know that the defined corner cases for left-shift are different in C and C++, but I thought right shifts were the same. Is there something different about the way right-shifts work in C and C++ that would cause the compiler to use different code to process them? Versions prior to 9.2.1 generate slightly less bad code in C mode:

    ldrh    r3, [r0]
    sxth    r0, r3
    asrs    r0, r0, #15
    adds    r0, r0, r3
    uxth    r0, r0
    bx      lr

equivalent to:

unsigned short adjust(unsigned short *p)
{
    unsigned r0,r3;
    r3 = *p;
    r0 = (short)r3;
    r0 = ((int)r0) >> 15; // Effectively computes -(temp>>15)
    r0 += r3
    r0 &= 0xFFFFu;     // Returning an unsigned short requires...
    return temp;       //  computing a 32-bit unsigned value 0-65535.
}

Not as bad as the 9.2.1 version, but still an instruction longer than a straightforward translation of the code would have been. When using 9.2.1, declaring the argument as unsigned short volatile *p would eliminate the redundant load of p, but I'm curious why gcc 9.2.1 would need a volatile qualifier to help it avoid the redundant load, or why such a bizarre "optimization" only happens in C mode and not C++ mode. I'm also somewhat curious why gcc would even consider adding ((short)temp) >> 15 instead of subtracting temp >> 15. Is there some stage in the optimization where that would seem to make sense?

supercat
  • 77,689
  • 9
  • 166
  • 211
  • I find it really odd that the C code is adding the shifted value rather than subtracting it as the source intends. – Mark Ransom Jun 19 '20 at 15:40
  • @MarkRansom: The way the shift is performed in the optimized version yields 0 or -1 instead of 0 or 1, so changing the subtract to an add is necessary for correctness. Unlike most of the gcc quirks I find, this one merely produces code that is slower than a straightforward translation, instead of yielding broken code. Though adding a redundant load of `*p` introduces needless opportunities for things to go wrong, there are some cases where, if redundant loads are semantically acceptable, they can improve code efficiency. This doesn't seem like one of them, though. – supercat Jun 19 '20 at 16:03
  • My brain must be unusually slow this morning, it didn't occur to me that shifting a 16-bit quantity by 15 bits would result in only two outcomes. – Mark Ransom Jun 19 '20 at 16:18
  • @supercat: Not entirely sure on the why, but if you change `temp` to `unsigned int` the compiler produces the same "reasonable machine code." – clyne Jun 19 '20 at 16:47
  • @clyne: Certainly there are alternative ways of writing the function that will yield better machine code, but what is being asked of the compiler here is hardly difficult. If an "optimization" would make a function bigger and slower than a straightforward translation of the code, don't try to apply it. – supercat Jun 19 '20 at 17:09
  • @supercat: Looking at this [in Compiler Explorer](https://godbolt.org/z/FeNER7) using the Tree/RTL Viewer, it appears that under C++ `temp` is promoted to an `int` for the right-shift, while under C `temp` is only promoted to a `signed short`. Perhaps the C-compiled version fails to optimize as nicely because of this (could there be differences in integer promotion/conversion between C and C++)? – clyne Jun 19 '20 at 21:03
  • @clyne: Interesting. The promotion rules of C specify that the promotion is to `int`, but gcc seems to think this is a transformation (use `signed short` instead of `int`, and invert the result) that should be applied at the tree stage; is gcc even thinking about the target processor at that point? – supercat Jun 19 '20 at 21:18
  • @clyne: Note that if the shift amount is changed to 14, the substitution no longer occurs. I'm not sure whether I should be impressed or horrified at the number of passes gcc's optimizer seems to be doing, but it's certainly interesting to take a piece of source which gcc processes in broken fashion and see that it's processed correctly up to a certain layer, whereupon the wheels fall off. – supercat Jun 19 '20 at 21:36
  • 1
    You might try some of the [`-fopt-info` options](https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html) to get a more in-depth look at what exactly the optimizer is doing instead of speculating. – chris Jun 19 '20 at 21:50
  • @chris: I just tried that and didn't get much useful for this example. Using `-fopt-info-all-internals` with a snippet where gcc makes a totally bogus aliasing "optimizaton" [ignoring the fact two consecutive accesses to an object using pointers *of the same type*, that were observed to be equal, might be affecting the same object] yielded an interesting notification that it was "skipping" the optimization, despite the fact that it ended up being erroneously applied anyway. – supercat Jun 19 '20 at 22:12
  • @clyne: You might want to write up your observation about the tree as an answer. – supercat Jun 19 '20 at 22:16
  • Is there any incorrect observable behaviour, or is this a performance question? – M.M Jun 20 '20 at 02:12
  • @M.M: It's mostly a performance issue, but it makes me question whether gcc is doing regression testing on both C and C++ modes. Further, code which expects that most implementations would extend the language by making "ordinary" accesses of things that might externally change will at worst either have reads yield stale values or have writes get deferred (which might sometimes be tolerable) could instead yield behavior inconsistent with any value that could have been observed. – supercat Jun 20 '20 at 03:09

1 Answers1

3

The difference appears to be due to a difference in integral promotion of temp between GCC's C and C++ compilation modes.

Using the "Tree/RTL Viewer" on Compiler Explorer, one can observe that when the code is compiled as C++, GCC promotes temp to an int for the right-shift operation. However, when compiled as C temp is only promoted to a signed short (On godbolt):

GCC tree with -xc++:

{
  short unsigned int temp = *p;

  # DEBUG BEGIN STMT;
    short unsigned int temp = *p;
  # DEBUG BEGIN STMT;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (temp = temp - (short unsigned int) ((int) temp >> 15)) >>>>>;
  # DEBUG BEGIN STMT;
  return <retval> = temp;
}

with -xc:

{
  short unsigned int temp = *p;

  # DEBUG BEGIN STMT;
    short unsigned int temp = *p;
  # DEBUG BEGIN STMT;
  temp = (short unsigned int) ((signed short) temp >> 15) + temp;
  # DEBUG BEGIN STMT;
  return temp;
}

The cast to signed short is only made explicit when shifting temp by one bit less than its 16-bit size; when shifting by less than 15 bits, the cast disappears and the code compiles to match the "reasonable" instructions -xc++ produced. The unexpected behavior also occurs when using unsigned chars and shifting by 7 bits.

Interestingly, armv7-a clang does not produce the same behavior; both -xc and -xc++ produce a "reasonable" result:

    ldrh    r0, [r0]
    sxth    r0, r0
    lsrs    r1, r0, #15
    adds    r0, r1, r0
    uxth    r0, r0
    bx      lr

Update: So it seems this "optimization" is due to either the literal 15, or to the use of subtraction (or unary -) with the right-shift:

  • Placing the literal 15 in an unsigned short variable causes both -xc and -xc++ to produce reasonable instructions.
  • Replacing temp>>15 with temp/(1<<15) also causes both options to produce reasonable instructions.
  • Changing the shift to temp>>(-65521) causes both options to produce the longer arithmetic-shift version, with -xc++ also casting temp to signed short within the shift.
  • Moving the negative away from the shift operation (temp = -temp + temp>>15; return -temp;) causes both options to produce reasonable instructions.

See examples these on Godbolt. I would agree with @supercat that this may just be an odd case of the as-if rule. The takeaways I see from this are to either avoid unsigned subtraction with non-constants, or per this SO post about int promotion, maybe don't try to force the arithmetic into smaller-than-int storage types.

clyne
  • 682
  • 4
  • 11
  • Is this a bug (of the optimizer) of GCC? I see nothing in the [C standard](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf) allowing a cast to `(signed short)`. Note that the same problem seems to occur when `15` is also explicitly casted to an unsigned short. – Jérôme Richard Jun 20 '20 at 07:56
  • @JérômeRichard: The substitution is replacing `(temp>>15)` with `-((signed short)(temp >> 15))` [note the negative sign]. This substitution is allowed under the as-if rule in C, and probably would be in C++ as well, but the need to accommodate operator overloading may make it more difficult ti apply the same logic there. – supercat Jun 20 '20 at 12:32
  • I just discovered something else interesting: The x86-64 version of gcc 10.1, even at -O0, converts the operation into a subtract of an arithmetic-right-shifted value regardless of whether it's in C or C++ mode, but the previous version, 9.3, only makes the change when in C mode. Which still leaves me puzzled as to why the transform is being applied in the first place at that point in the compilation, since it would strike me as sometimes degrading performance, and not being likely enough to improve performance as to be not worth the trouble implementing it. – supercat Jun 20 '20 at 17:33
  • Perhaps there are some platforms where turning the MSB of a number into a 0 or 1 would be faster than doing a right shift, but such a transformation should be applied as part of platform-specific optimization. – supercat Jun 20 '20 at 17:35