Why does this code break when -O2 or higher is enabled?

Question

I tried to fit an implementation of NSA's SPECK in a 8-bit PIC microcontroller. The free version of their compiler (based on CLANG) won't enable optimizations so I ran out of memory. I tried the "trial" version that enables -O2, -O3 and -Os (optimize for size). With -Os It managed to fit my code in the 2K program memory space.

Here's the code:

#include <stdint.h>
#include <string.h>

#define ROR(x, r) ((x >> r) | (x << (32 - r)))
#define ROL(x, r) ((x << r) | (x >> (32 - r)))
#define R(x, y, k) (x = ROR(x, 8), x += y, x ^= k, y = ROL(y, 3), y ^= x)
#define ROUNDS 27

void encrypt_block(uint32_t ct[2],
        uint32_t const pt[2],
        uint32_t const K[4]) {
    uint32_t x = pt[0], y = pt[1];
    uint32_t a = K[0], b = K[1], c = K[2], d = K[3];

    R(y, x, a);
    for (int i = 0; i < ROUNDS - 3; i += 3) {
        R(b, a, i);
        R(y, x, a);
        R(c, a, i + 1);
        R(y, x, a);
        R(d, a, i + 2);
        R(y, x, a);
    }
    R(b, a, ROUNDS - 3);
    R(y, x, a);
    R(c, a, ROUNDS - 2);
    R(y, x, a);

    ct[0] = x;
    ct[1] = y;
}

Unfortunately, when debugging it line by line, comparing it to the test vectors in the implementation guide, from page 32, "15 SPECK64/128 Test Vectors", the results difer from the expected results.

Here's a way to call this function:

uint32_t out[2];
uint32_t in[] = { 0x7475432d, 0x3b726574 };
uint32_t key[] = { 0x3020100, 0xb0a0908, 0x13121110, 0x1b1a1918 };

encrypt_block(out, in, key);

assert(out[0] == 0x454e028b);
assert(out[1] == 0x8c6fa548);

The expected value for "out", according to the guide, should be 0x454e028b, 0x8c6fa548. The result I'm getting with -O2 is 0x8FA3FED7 0x53D8CEA8. With -O1, I get 0x454e028b, 0x8c6fa548, which is the correct result.

Step Debugging

The implentation guide includes all the intermediate key schedule other values, so I stepped through the code line by line, comparing the results to the guide.

The expected results for "x" are: 03020100, 131d0309, bbd80d53, 0d334df3. I start step debugging, but when reaching the 4th result, 0d334df3, the debugger window shows 0d334df0 instead. By the next round, the expected 7fa43565 value is 7FA43578 and only gets worse with every iteration.

This only happens when -O2 or greater is enabled. With no optimizations, or with -O1, the code works as expected.

TL;DR: undefined behaviour in your code. start by protecting your macros arguments by parentheses. when using `i + 2` not sure it does what you want. — Jean-François Fabre, Mar 08 '19 at 18:44
I see the problem. When OP posts the inputs used and a [MCVE], it will be obvious. — chux - Reinstate Monica, Mar 08 '19 at 19:02
@chux is there other than the "macro argument not parenthesized"? — Antti Haapala -- Слава Україні, Mar 08 '19 at 19:04
@AnttiHaapala "macro argument parenthesized" would be better form, yet I think OP escaped that pitfall luckily here. Yet without `[MCVE]` OP is making this harder than it needs to be. — chux - Reinstate Monica, Mar 08 '19 at 19:07
because the only argument which has 2 terms is used for the eoring `x ^= k` so it works. But leave the parentheses from now on. — Jean-François Fabre, Mar 08 '19 at 19:09
but but, your inputs are `char` pointers and your function accepts pointers on integers. Don't you have warnings? it's an alignment issue. — Jean-François Fabre, Mar 08 '19 at 19:11
and endianness :) compilers tend to move/align variables according to optimization levels. I'm not going to answer this to get "UB points" as Antti likes to call them :) — Jean-François Fabre, Mar 08 '19 at 19:12
casting to uint32_t* suppresses the errors but the results are the same. Endianness is not an issue here, because I manually checked the endianness, and if it was actually an endianness issue, it would also fail with -O0 or -O1 — hjf, Mar 08 '19 at 19:14
it's not an issue with endianness, but don't write code like that as it is not portable on other processors with different endianness — Jean-François Fabre, Mar 08 '19 at 19:15
casting to `uint32_t *` does not fix the error. In any case, I get it behave in "expected" manner with *that* hideous call without casts and anything with -O3 on my x86-64... `-fsanitize=undefined`, valgrind, etc... result `8b 02 4e 45 48 a5 6f 8c` — Antti Haapala -- Слава Україні, Mar 08 '19 at 19:16
@Jean-FrançoisFabre I'm working with 256 BYTES of RAM. Not 256K, but 2^8 bytes. For my whole program. Certain concessions need to be made. — hjf, Mar 08 '19 at 19:17
try with little-endian `uint32_t`s: `0x7475432d, 0x3b726574`; `0x3020100, 0xb0a0908, 0x13121110, 0x1b1a1918`; the result should be `0x454e028b, 0x8c6fa548` — Antti Haapala -- Слава Україні, Mar 08 '19 at 19:26
you just have to read the comments properly to know what the issue is: don't declare your data as pointers on `char`. The compiler will assume that it can put them in unaligned memory locations. And please be nice. There's no elitism here. — Jean-François Fabre, Mar 08 '19 at 19:26
`uint32_t out[2]; uint32_t in[] = { 0x7475432d, 0x3b726574 }; uint32_t key[] = { 0x3020100, 0xb0a0908, 0x13121110, 0x1b1a1918 };` — Antti Haapala -- Слава Україні, Mar 08 '19 at 19:32
@AnttiHaapala with your suggestions, the results are the same as mine. Fine with -O1, but breaks with -Os. The results are the same. The bad result with uint32_t is 0x8FA3FED7 0x53D8CEA8 and with a char array is 0xD7 0xFE 0xA3 0x8F 0xA8 0xCE 0xD8 0x53. At least the results are consistent... — hjf, Mar 08 '19 at 19:39
Minor: use `unsigned i` to quiet a distracting warning in `for (int i = 0; i < ROUNDS - 3; i += 3) R(b, a, i); ....` as the `R()` then mixes sign-ness. — chux - Reinstate Monica, Mar 08 '19 at 19:42
To rule out a bug I think my only other option is to find their previous version of the compiler. This is 2.0.5, based on Clang. The 1.45 version was a different compiler. Unfortunately the company has pulled 1.45 from their website. — hjf, Mar 08 '19 at 19:49
It does not appear `"speck.h"` is needed here. If it has something _interesting_, let us see it, else better to remove it. — chux - Reinstate Monica, Mar 08 '19 at 19:52
@chux, you mean the rewrite of R? I just tried it. Same wrong result with -Os. I'm really starting to think this is a compiler bug. — hjf, Mar 08 '19 at 19:54
I am leaning towards a compiler bug here if the code as it is now in the question with all `uint32_t`s etc still produce the same behaviour. No other plausible explanation comes to mind... — Antti Haapala -- Слава Україні, Mar 08 '19 at 19:55
UB can occur from _other_ un-posted code. We do not see how OP fed `encrypt_block(out, in, key);` nor reported its output. Hopefully the missing code is benign, yet it is remains unknown. — chux - Reinstate Monica, Mar 08 '19 at 20:04
IIUC, the output comes from the debugger. But then again, the *debugger might be wrong*. — Antti Haapala -- Слава Україні, Mar 08 '19 at 20:05
@hjf can you do asserts in the controller? Or sth like set a variable for `int flag = (out[0] == 0x454e028b && out[1] == 0x8c6fa548)` and act on `flag` to ensure that it is not the debugger that's the issue? — Antti Haapala -- Слава Україні, Mar 08 '19 at 20:07
@AnttiHaapala Agreed, With optimizations, results needs to be truly _used_ as output, else it may be optimized away and the debugger can report garbage - perhaps consistent garbage. — chux - Reinstate Monica, Mar 08 '19 at 20:08
Updates: I managed to find the previous version of the compiler... deeplinked to the manufacturer's website from a dodgy chinese site. Anyways, same error. I also did volatile int flag = (out[0] == 0x454e028b && out[1] == 0x8c6fa548) (the compiler never optimizes volatile). The result was zero. Also volatile uint32_t r1=out[0]; volatile uint32_t r2=out[1];. r1 and r2 are the usual "right with O1, wrong with O3". — hjf, Mar 08 '19 at 20:14
@hjf Re: "I'm really starting to think this is a compiler bug" Could be. Yet UB cased by the calling code remains a possibility. For the record, what is `int/unsigned` bit size here? `UINT_MAX`? — chux - Reinstate Monica, Mar 08 '19 at 20:16
@chux how could be the reason here? 27 should fit into an `int` and it should be correctly promoted anyway.. — Antti Haapala -- Слава Україні, Mar 08 '19 at 20:18
@chux see here for the compiler manual on int types https://i.imgur.com/2MsC05s.png — hjf, Mar 08 '19 at 20:20
@AnttiHaapala The reason for `int/unsigned` inquiry is not due to the `for()` loop. As `unsigned` is 16-bit (much code is faulty with 16-bit) and prior code was loose on pointer usage, these 2 hint to calling code woes. Also I wanted to review this code with non-32-bit `int` in mind if warranted. — chux - Reinstate Monica, Mar 08 '19 at 20:24
@hjf Since, under some conditions, code get the expected result, try `#define ROUNDS 0`, `#define ROUNDS 3` and/or maybe `uint32_t in[] = { 0 }; uint32_t key[] = { 0} ` to close in on the culprit. Good luck! — chux - Reinstate Monica, Mar 08 '19 at 20:33
Not related to the bug, but your size problem could probably be fixed by getting rid of the awful manual unrolling and writing the loop directly. It would also deobfuscate it quite a bit. — R.. GitHub STOP HELPING ICE, Mar 08 '19 at 23:42
Please [edit] your question to include disassembly of the compiler output for this function at both -O1 and -O2 optimization settings. Looking at what machine code the compiler is actually generating is by far the easiest way to figure out where the problem is in your C code. I'd obtain the disassembly myself, but this is a proprietary compiler that you're using. — Cody Gray - on strike, Mar 09 '19 at 02:28
Update: I have asked the question in the manufacturer's website. Other people have reproduced the issue and confirmed it is actually a bug. It happens with PIC16 parts but not with PIC18. As a workaround, I replaced the macro with an actual function, and did the (x >> r) | (x << (32 - r)) operation in two steps: x = x >> r; x |= x<<(32-r); This seems to work. — hjf, Mar 11 '19 at 20:32

score 4 · Accepted Answer · answered Mar 11 '19 at 20:41

It was a bug in the compiler.

I posted the question in the manufacturer's forum. Other people have indeed reproduced the issue, which happens when compiling for certain parts. Other parts are unaffected.

As a workaround, I changed the macros into real functions, and split the operation in two lines:

uint32_t ROL(uint32_t x, uint8_t r) {
    uint32_t intermedio;
    intermedio = x << r;
    intermedio |= x >> (32 - r);
    return intermedio;
}

This gives the correct result.

score 2 · Answer 2 · answered Mar 08 '19 at 20:00

Posting compilable test code as a reference.

#include <stdint.h>
#include <string.h>
//#include "speck.h"

#define ROR(x, r) ((x >> r) | (x << (32 - r)))
#define ROL(x, r) ((x << r) | (x >> (32 - r)))
#define R(x, y, k) (x = ROR(x, 8), x += y, x ^= k, y = ROL(y, 3), y ^= x)
#define ROUNDS 27

void encrypt_block(uint32_t ct[2], uint32_t const pt[2], uint32_t const K[4]) {
  uint32_t x = pt[0], y = pt[1];
  uint32_t a = K[0], b = K[1], c = K[2], d = K[3];

  R(y, x, a);
  // for (int i = 0; i < ROUNDS - 3; i += 3) {
  for (uint32_t i = 0; i < ROUNDS - 3; i += 3) {
    R(b, a, i);
    R(y, x, a);
    R(c, a, i + 1);
    R(y, x, a);
    R(d, a, i + 2);
    R(y, x, a);
  }
  R(b, a, ROUNDS - 3);
  R(y, x, a);
  R(c, a, ROUNDS - 2);
  R(y, x, a);

  ct[0] = x;
  ct[1] = y;
}

int main(void) {
  uint32_t out[2];
  uint32_t in[] = {0x7475432d, 0x3b726574};
  uint32_t key[] = {0x03020100, 0x0b0a0908, 0x13121110, 0x1b1a1918};
  encrypt_block(out, in, key);

  printf("%8lx %8lx\n", (unsigned long) out[0], 0x454e028bLU);
  printf("%8lx %8lx\n", (unsigned long) out[1], 0x8c6fa548LU);
}

Output

454e028b 454e028b
8c6fa548 8c6fa548

Unexpected output

0x8FA3FED7
0x53D8CEA8

score 1 · Answer 3 · answered Mar 08 '19 at 23:37

I don't see any indication of undefined behavior in your code, unless it's in some aspect of the setup/call point that you haven't shown. As such, it should be impossible for behavior to differ according to optimization level. Normally I would not be quick to blame compiler bugs for something like this, but forks of FOSS compilers for embedded stuff, and especially forks that redefine int to 16-bit in a compiler that wasn't designed to target 16-bit int, and especially proprietary forks where their code is so bad that they don't even want to let you see it, a compiler bug is very very likely.

Why does this code break when -O2 or higher is enabled?

3 Answers3