17

I have this small snippet of code (this is a minimal working example of the problem I have):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void xorBuffer(unsigned char* dst, unsigned char* src, int len)
{
    while (len != 0)
    {
        *dst ^= *src;
        dst++;
        src++;
        len--;
    }
}

int main()
{
    unsigned char* a = malloc(32);
    unsigned char* b = malloc(32);
    int t;

    memset(a, 0xAA, 32);
    memset(b, 0xBB, 32);

    xorBuffer(a, b, 32);

    printf("result = ");
    for (t = 0; t < 32; t++) printf("%.2x", a[t]);
    printf("\n");

    return 0;
}

This code is supposed to perform the exclusive-or of two 32-byte memory buffers (conceptually, this should do a = a ^ b). Since 0xAA ^ 0xBB = 0x11, it should print "11" thirty-two times.

My problem is, when I compile this under MinGW-GCC (Windows), this works perfectly in debug mode (no optimizations) but crashes with a SIGILL midway through the xorBuffer loop when optimizations starting from -O3 are enabled. Also, if I put a printf in the offending loop, it'll work perfectly again. I suspect stack corruption but I just don't see what I'm doing wrong here.

Trying to debug with GDB with optimizations enabled is a lost cause as all GDB shows me is "variable optimized out" for every variable (and, of course, if I try and printf a variable out, it'll suddenly work).

Does anybody know what the heck is going on here? I have spent too long dwelling on this issue, and I really need to fix it properly to move on. My guess is I am missing some fundamental C pointer knowledge, but to me the code looks correct. It could be from the buffer incrementation, but as far as I know, sizeof(unsigned char) == 1, so it should be going through each byte one by one.

For what it's worth, the code works even with optimizations on GCC on my Linux box.

So... what's the deal here? Thanks!

As requested, the assembly output of the whole program:

With -O2: clicky

With -O3: clicky

I observe this behavior on GCC 4.6.2 (running with MinGW)

Thomas
  • 3,321
  • 1
  • 21
  • 44
  • 3
    What about using `O2`? `O3` is pretty risky. SIGILL means - signal, caused by illegal instruction. Looks like a compiler bug to me. Or I'm missing something. – Kiril Kirov Aug 22 '12 at 09:56
  • 1
    This works perfect on my gcc 4.4.3. If nobody notice anything, you may try to show us the assembly code, produced by gcc, in the both cases - with `O3` and `O2` (if it works fine; otherwise - lower level of optimization) – Kiril Kirov Aug 22 '12 at 10:02
  • @KirilKirov This bug was originally discovered in a high-performance library so going down to O2 is not really the best solution for us - of course if it's a compiler bug we will drop down to O2 until a fix is issued. I will post the assembly produced with no optimizations and with -O3. – Thomas Aug 22 '12 at 10:09
  • 2
    @Thomas: Post with -O2 and with -O3. The smallest change that breaks it will be the most likely to help us find the bug quickly. – David Schwartz Aug 22 '12 at 10:09
  • @Thomas - +1 for the comment of David, because there will be too many differences between no optimizations and -O2. That's why I asked for -O2 :) – Kiril Kirov Aug 22 '12 at 10:10
  • @KirilKirov What do you mean by "risky"? I thought optimizations were guaranteed to preserve correctness, assuming the code is correct and the compiler is bug-free. Or did you mean it was more likely to reveal a compiler bug? Noted on the optimization assembly output. – Thomas Aug 22 '12 at 10:11
  • Install a signal handler with `SA_SIGINFO` and print out the `siginfo_t` parameter, in particular the signal code and whether it's `ILL_ILLOPC` etc. – ecatmur Aug 22 '12 at 10:15
  • @Thomas - as far as I know, this is guaranteed for `-O2`. This is one of the reasons, most of the widely used libraries are compiled with `-O2`, not higher. But I may be wrong, of course. – Kiril Kirov Aug 22 '12 at 10:20
  • @KirilKirov assembly output added. Ecatmur wow, heh, I will have to read up on how to do this, I've never used signal handlers before. And wow, I didn't know that about -O2, I will definitely keep that in the back of my mind. – Thomas Aug 22 '12 at 10:21
  • @Thomas - unfortunately, I'm not that good with assembly. I just asked for it as an idea and a way to see what could be wrong. Maybe someone else, who knows assembly very well will be able to help. Good luck, I'll keep following what's going on here, you got my attention :D By the way, what's your compilers version? – Kiril Kirov Aug 22 '12 at 10:23
  • @KirilKirov It's MinGW running GCC 4.6.2 (outdated, I know, but it's bundled with a portable-app - for mobility reasons) – Thomas Aug 22 '12 at 10:30
  • 5
    I'm certainly no expert with x86 assembly, but it seems the O3 version is a) unrolling the loop and b) using SIMD instructions. Are you certain about all the `-march` and/or/whatever the options are to properly specify the exact processor you're building for? Getting instructions for the wrong ISA might explain the crash ... – unwind Aug 22 '12 at 10:39
  • I would recommend to cast the malloc (unsigned char*), And i know that it is not the answer to your question, and malloc certainly returns the allocated byte count which is also a char *, but with a pedantic compilation this is considered as a warning or an error (i don't remember). – Michael Aug 22 '12 at 10:45
  • 1
    @Michael casting the return of malloc is incorrect. See http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc – ecatmur Aug 22 '12 at 10:48
  • 2
    @unwind I am using -march=native on a Sandy Bridge i5-2500k. It definitely looks as if GCC is producing an instruction which isn't compatible with my CPU's instruction set, but then it means it must be misdetecting my processor as I gave it native. I think you have hit the nail on the head here, since -O3 works using the -march=core2 flag. – Thomas Aug 22 '12 at 10:53
  • @ ecatmur: Thanks , i wasn't aware of it, i am writing mostly c++. – Michael Aug 22 '12 at 10:54

2 Answers2

8

From my comment:

Make sure the compiler has the right information about the target architecture. It seems, from reading the -O3 output, that the compiler is setting you up the SIMD optimization, it's in effect making the code more parallel by using vector instructions (such as movdqa). If the target processor doesn't match 100% what the compiler is emitting code for, you might end up with illegal instructions.

unwind
  • 391,730
  • 64
  • 469
  • 606
8

I am adding this as an extension of Unwind's answer (which I accept as it got me on the right track).

After sifting through the optimized code, I noticed AVX instructions. At first, I thought that it shouldn't cause an issue, considering my processor supports the AVX instruction set. However, it turns out that there are two distinct AVX versions: AVX1 and AVX2. And, while my processor only supports AVX1, gcc indiscriminately uses AVX2 opcodes as long as the processor supports any of the two versions (llvm made the same mistake, there are bug reports on that). This is, as far as I can conceive, incorrect operation, and a compiler bug.

The result is AVX2 code on an AVX1 system, which obviously leads to an illegal instruction. It explains many things, from the code not failing on inputs smaller than 32 bytes (because of the 256-bit register width), to the code working on my Linux box, which happens to be a virtual machine with CPU support limited to SSE3.

The fix is either to disable -O3 and go back to -O2, where gcc won't resort to the most hardcore SIMD instructions to optimize simple code, or to use the volatile keyword which will force it to go through the buffers byte per byte, painstakingly, like so:

*(unsigned char volatile *)dst ^= *(unsigned char volatile *)src;

This is of course very slow and probably worse than just using -O2 (ignoring whole-program repercussions), but it can be worked around by going through the buffer int by int instead and padding at the end, which is good enough in terms of speed.

Another good fix is to upgrade to a version of gcc which does not have this bug (this version may not exist yet, I have not checked).

EDIT: the ultimate fix is to throw the -mno-avx flag at GCC, thereby disabling any and all AVX opcodes, completely negating the bug with no code modifications (and can easily be removed once a patched compiler version is available).

What a perverse compiler bug.

Thomas
  • 3,321
  • 1
  • 21
  • 44