Why O2 optimization level breaks struct array initialization in 32-bit bare metal environment

Question

I was writing a function which initializes an array of two structs. This function works perfectly with O0 and O1 optimization levels, but it breaks with O2, causing an Invalid opcode exception on the line I indicated with an arrow.

This code runs inside QEMU in a 32-bit bare metal setup. I can't understand why it crashes after executing the movdqa instruction. What could be causing this error? Could it be a compiler error? Or are we maybe missing something (e.g. configuring the FPU or something like that)?

C code

OSS: This is not the real code, but only a reproducible example which compiles to the same assembly.

/**
 * gcc version 12.2
 * CFLAGS: -ffreestanding -m32 -O2 -std=gnu99 -Wall 
 * -Wextra -Werror -fno-pie -fno-stack-protector -g
 */

#include <stdint.h>
#include <string.h>

#define BUTOS_START         0
#define BUTOS_SIZE          0x25

#define FSTAB_SECTOR        0x24
#define FILESYSTEM_START    0x25
#define FILESYSTEM_SIZE     0x10

#define FS_BLOCK(START,     \
                 SIZE)      (struct fs_block){.start=START, .size=SIZE}

struct fs_block
{
    uint32_t start; // LBA
    uint32_t size;  // Size in sectors
};

void fstab_write(uint8_t *buff_sector)
{
    struct fs_block fstab[] = {
        FS_BLOCK(BUTOS_START, BUTOS_SIZE),
        FS_BLOCK(FILESYSTEM_START, FILESYSTEM_SIZE)
    };

    memcpy(buff_sector, fstab, sizeof(fstab));
}

Assembly output

O1:

fstab_write:
        sub     esp, 32
        mov     DWORD PTR [esp+4], 0
        mov     DWORD PTR [esp+8], 37
        mov     DWORD PTR [esp+12], 37
        mov     DWORD PTR [esp+16], 16
        push    16
        lea     eax, [esp+8]
        push    eax
        push    DWORD PTR [esp+44]
        call    memcpy
        add     esp, 44
        ret

O2:

fstab_write:
        sub     esp, 32
    ==> movdqa  xmm0, XMMWORD PTR .LC0  /*point of crash */
        movaps  XMMWORD PTR [esp+4], xmm0
        push    16
        lea     eax, [esp+8]
        push    eax
        push    DWORD PTR [esp+44]
        call    memcpy
        add     esp, 44
        ret
.LC0:
        .long   0
        .long   37
        .long   37
        .long   16

`movdqa` is for aligned data and I'd think it shouldn't be used to replace `memcpy()` in the C code you've posted, but as that's not the real code it's hard to tell. What happens if you align `uint8_t *buff_sector` to a 16-byte boundary? — Andrew Henle, Dec 29 '22 at 14:58
Probably because you haven't enabled SSE support and as a result #UD (invalid oipcode) is raised: https://wiki.osdev.org/SSE#Adding_support — Michael Petch, Dec 29 '22 at 15:17
@AndrewHenle : It isn't replacing `memcpy` it is using SSE instructions to copy the value to the stack. The unaligned access won't generate an invalid opcode. You have to set up the processor to support SSE or you get an invalid opcode exception (#UD). As well unaligned access (with aligned instructions) only get triggered at ring 3 and only if unaligned checks are enabled in the processor. — Michael Petch, Dec 29 '22 at 15:47
If you don't want to use any SSE or AVX instructions you have the option of using `-mgeneral-regs-only` (*if* using `libgcc` then you'd have to ensure it is generated without SSE/AVX instructions as well) — Michael Petch, Dec 29 '22 at 15:52
I should point out that although I said `memcpy` isn't being replaced in this case, it could have been if the OP wasn't using `-ffreestanding`. Without freestanding the compiler could have generated a `movdqa` and `movups` instruction to perform the copy. — Michael Petch, Dec 29 '22 at 16:04
@MichaelPetch That makes sense, but the question does currently state, "**C code** OSS: This is not the real code, but only a reprodusable example which compiles to the same assembly." That means `memcpy()` would have to have been replaced with the posted assembly. Which might happen if the compiler assumes all your other conditions for the use of `movdqa` will be met at run time. Which does not appear to be a safe assumption. — Andrew Henle, Dec 29 '22 at 17:20
@AndrewHenle : Alignment Check Error (#AC) is not being raised. Invalid Opcode is (#UD). It is not an alignment error at this point (In theory it could be but that's not the fault he's dealing with) — Michael Petch, Dec 29 '22 at 17:36
Regarding alignment: What the OP isn't showing is that GCC 12.2 generates an `.align 16` directive just before label `.LC0`, so that will be properly aligned for the `movdqa` instruction. The code being shown isn't replacing `memcpy` it is setting up the stack so it can `call memcpy` near the bottom. It hasn't even got that far. Another assumption (for alignment purposes) the compiler is making is that the stack (per the ABI) was aligned on a 16 byte boundary before the function call. It happens to be if it was ESP+4 in the `movaps` instruction is 16 byte aligned as well — Michael Petch, Dec 29 '22 at 17:38
Note: In cases where alignment checking isn't applicable or alignment checking isn't enabled an #AC won't be raised but a #GP fault will be. Again, #UD (invalid opcode) is saying the SSE2 instruction itself is the issue (not an alignment issue - yet), and that error will occur if the processor hasn't enabled SSE instructions. It can be enabled per these instructions: https://wiki.osdev.org/SSE#Adding_support . The OP is running on baremetal not inside Linux so they need to setup the processor correctly so that SSE instructions won't fault. — Michael Petch, Dec 29 '22 at 17:44
Another note: While the OP hasn't shown their `memcpy` implementation it is currently not relevant because it is failing in setting up the parameters to call `memcpy`. — Michael Petch, Dec 29 '22 at 17:52
OP: By any chance did you have an OS that worked previously and then when you started using GCC 12.0+ it stopped? I noticed that GCC 12.x started generating code like this where previous versions it didn't. Now, Unless you tell GCC to not generate SSE/SIMD instructions it will do so for efficiency in this case. That requires your OS to setup the processor registers to handle SSE instructions. — Michael Petch, Dec 29 '22 at 19:08
@MichaelPetch firsr of all thanks for the answer, it really helped. Talking about the last thing you asked, I've been working on this projecr for quite a lot of time, but it never occurred to me that a gcc update broke my code. The previous function is indeed a new one. I think this is the first time gcc tried to optimize the code with an SSE instruction, even though I use O2 in the whole project. — Giovanni Zaccaria, Dec 30 '22 at 00:55
Just for the record, `movdqa` always `#GP` faults on unaligned access, even in ring 0, so this code is also relying on the default 16-byte stack alignment that current versions of the i386 System V ABI requires. (If SSE is enabled so it doesn't `#UD` first). Michael Petch brought up `#AC` faults a few times, and only later corrected that to `#GP` - SSE instructions with 16-byte memory operands in general won't `#AC` fault, only ones like `movd` or `pextrw` with 2, 4, or 8-byte memory. https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol1/o_7281d5ea06a5b67a-362.html quotes Intel — Peter Cordes, Dec 30 '22 at 02:18
This probably has nothing to do with your problem but I once messed up my stack in assembly and had no idea what the problem was. This problem reminds me of it. I always mix up if it needs to be 16byte aligned before or after a call. I think I needed it to be off by 8 because I was using CALL instead of JMP. I think clang let me forget about it while gcc disliked it immediately — , Dec 30 '22 at 05:27

Why O2 optimization level breaks struct array initialization in 32-bit bare metal environment

C code

Assembly output

0 Answers0