2

Environment: GCC 4.7.3 (arm-none-eabi-gcc) for ARM Cortex m4f. Bare-metal (actually MQX RTOS, but here that's irrelevant). The CPU is in Thumb state.

Here's a disassembler listing of some code I'm looking at:

//.label flash_command
// ...
while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}
// Compiles to:
12: bf00        nop
14: f04f 0300   mov.w   r3, #0
18: f2c4 0302   movt    r3, #16386  ; 0x4002
1c: 781b        ldrb    r3, [r3, #0]
1e: b2db        uxtb    r3, r3
20: b2db        uxtb    r3, r3
22: b25b        sxtb    r3, r3
24: 2b00        cmp r3, #0
26: daf5        bge.n   14 <flash_command+0x14>

The constants (after expending macros, etc.) are:

address of FTFE_FSTAT is 0x40020000u
FTFE_FSTAT_CCIF_MASK is 0x80u

This is compiled with NO optimization (-O0), so GCC shouldn't be doing anything fancy... and yet, I don't get this code. Post-answer edit: Never assume this. My problem was getting a false sense of security from turning off optimization.

I've read that "uxtb r3,r3" is a common way of truncating a 32-bit value. Why would you want to truncate it twice and then sign-extend? And how in the world is this equivalent to the bit-masking operation in the C-code?

What am I missing here?

Edit: Types of the thing involved: So the actual macro expansion of FTFE_FSTAT comes down to

((((FTFE_MemMapPtr)0x40020000u))->FSTAT)

where the struct is defined as

/** FTFE - Peripheral register structure */
typedef struct FTFE_MemMap {
    uint8_t FSTAT; /**< Flash Status Register, offset: 0x0 */
    uint8_t FCNFG; /**< Flash Configuration Register, offset: 0x1 */
    //... a bunch of other uint_8
} volatile *FTFE_MemMapPtr;
Dmitri
  • 1,338
  • 1
  • 12
  • 24
  • 2
    Don't disassemble your code, just use `-S` to produce the assembler directly. If you are lucky it may even be a bit more comprehensive. – Jens Gustedt Dec 09 '15 at 21:49
  • I think you're missing a `*` somewhere. Otherwise, your code is just `while(1)`. – user3386109 Dec 09 '15 at 21:54
  • @JensGustedt Thanks for the tip. Tried it, but the compile-time assembly listing says the same exact thing. So does reading back the memory directly through the IDE's built-in disassembler. :/ – Dmitri Dec 09 '15 at 22:02
  • @user3386109 Does look like an infinite loop. And yet it seems to run as intended in the C code (this is a write-flash function and the flash does get written without getting stuck in an infinite loop waiting for the write-finished flag). :O – Dmitri Dec 09 '15 at 22:04
  • I am missing the C code, – Jongware Dec 09 '15 at 22:06
  • the C code is "while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}". That is all. – Dmitri Dec 09 '15 at 22:07
  • 3
    Either that's not the actual `while` statement, or those aren't the actual `#define`s. – user3386109 Dec 09 '15 at 22:08
  • sorry. those are not actual defines. edited to clarify. the mask is an actual define, but FTFE_FSTAT is a control register and goes through a bunch of macros to get to a pointer dereference (as is customary in MCUs). so 14: and 18: are loading the address of that register into r3 and the ldr is dereferencing it and loading its contants into r3. that part is straightforward and working as expected. what is happening after that I have no clue. – Dmitri Dec 09 '15 at 22:11
  • 1
    What are the _types_ of the things involved? Are there any other expressions hidden in these unknown macros? It's very hard to reason about code generation without having the same information that the compiler had at the time... – Notlikethat Dec 09 '15 at 22:14
  • 1
    The behaviour you've noticed is expected. Without optimization enabled GCC generates bad code that does a lot of unnecessary things. The assembly code is correct. It will loop so long as the the eight bit (counting from one) of the byte at 0x40020000 is not set. – Ross Ridge Dec 09 '15 at 22:20
  • @RossRidge What indicates that it checks the 0x80 bit specifically? That's the big thing I am not seeing here. As far as inefficiency, you are right that is to be expected, so it should not be surprising if some of those uxtb/sxtb don't actually do anything. – Dmitri Dec 09 '15 at 22:24
  • The sign extension does something. It's the zero extensions that are unnecessary as their effect is undone by the sign extension. – Ross Ridge Dec 09 '15 at 22:28
  • @RossRidge I was starting to go in the right direction based on your hint and then I saw user3386109's answer. Thank you! Case closed. – Dmitri Dec 09 '15 at 22:33
  • 2
    Two lessons I learned here: 1. The compiler need not be straightforward about anything even with -O0. 2. Code generated with -O0 is a lot worse than I would have guessed. – Dmitri Dec 09 '15 at 22:38
  • 1
    Yup, unoptimized code can be pretty bad. The compiler typically wastes a lot of time loading/storing variables on the stack, and it finds other ways to waste time too :) – user3386109 Dec 09 '15 at 22:46
  • If you want to look at code that isn't horrible, and mostly does what the source code says (without aggressive loop tranformations or optimizing stuff away), use `-Og` (optimize for debugging). `-O0` spills everything to memory after every C statement. (I guess for the benefit of stupid debuggers / debug-info formats that can't find variable values in registers) – Peter Cordes Dec 10 '15 at 21:20

1 Answers1

7

The two uxtb instructions are the compiler being stupid, they should be optimized out if you turn on optimization. The sxtb is the compiler being brilliant, using a trick that you wouldn't expect in unoptimized code.

The first uxtb is due to the fact that you loaded a byte from memory. The compiler is zeroing the other 24 bits of register r3, so that the byte value fills the entire register.

The second uxtb is due to the fact that you're ANDing with an 8-bit value. The compiler realizes that the upper 24-bits of the result will always be zero, so it's using uxtb to clear the upper 24-bits.

Neither of the uxtb instructions does anything useful, because the sxtb instruction overwrites the upper 24 bits of r3 anyways. The optimizer should realize that and remove them when you compile with optimizations enabled.

The sxtb instruction takes the one bit you care about 0x80 and moves it into the sign bit of register r3. That way, if bit 0x80 is set, then r3 becomes a negative number. So now the compiler can compare with 0 to determine whether the bit was set. If the bit was not set then the bge instruction branches back to the top of the while loop.

user3386109
  • 34,287
  • 7
  • 49
  • 68
  • AAH! So when the condition is true the number in r3 is negative and therefore less than zero! That's genius! Thank you, I can sleep tonight! – Dmitri Dec 09 '15 at 22:31
  • 6
    It's perhaps noteworthy that a `uxtb` for the load is trebly stupid since `ldrb` already zero-extends the value, and that it makes more sense when you know `ge` ("greater than or equal") is specifically a _signed_ interpretation of the flags (as opposed to `hs`, which is _unsigned_ "higher or same"). – Notlikethat Dec 09 '15 at 22:32
  • 1
    gcc does pretty much all of its optimization on an intermediate representation of what the code does, not on a sequence of anything like target-machine instructions (ARM in this case). As http://stackoverflow.com/a/33284629/224132 explains, producing an executable from its internal representation after parsing requires some transformations of the internal representation. gcc doesn't have a mode that tries to literally translate C to ARM. So you get some "clever" stuff even at `-O0`, because that's the only way gcc knows how to do some things. It doesn't have the dumb way programmed in. – Peter Cordes Dec 10 '15 at 21:28