Output of _m256i changes with optimization level

Question

I have the following code which I compiled with g++ 6.1 on windows using the MinGW compiler.

unsigned char test_uc[8] = {0, 0xaa, 0xbb, 0xcc, 0x11, 0x22, 0x33, 0x44};

uint64_t* p64 = (uint64_t*)test_uc;
__m256i res = _mm256_cvtepu8_epi32 (_mm_cvtsi64_si128(*p64));
uint32_t* u32 = (uint32_t*)&res;
for(int i = 0; i < 8; i++)
    printf("%d.0x%x\n", i, u32[i]);

When I run the program with optimization level -O1, I get the expected output as shown below.

0.0x0
1.0xaa
2.0xbb
3.0xcc
4.0x11
5.0x22
6.0x33
7.0x44

However, if I switch to optimization level -O3, I get this strange output.

0.0x0
1.0x0
2.0x0
3.0x0
4.0x8
5.0x0
6.0x41027f
7.0x0

What is going on here?

Because of a strict aliasing violation. Compile with `-O3 -fno-strict-aliasing` and it should solve the problem. — Lundin, Dec 14 '16 at 14:50
Try with `-fno-strict-aliasing` - all that nasty type-punning may be causing problems. — Paul R, Dec 14 '16 at 14:50
Seriously? **undefined behaviour**. Something every C programmer should know about right after his very first `hello world`. If you don't know, you definitively should research about it. And don't mess with compiler options here. — too honest for this site, Dec 14 '16 at 14:51
@pythonic UB is C slang for "undefined behavior", meaning that you have a bug in your program where you did things beyond what the C standard guarantees, which in turn can cause anything to happen. — Lundin, Dec 14 '16 at 14:51
@Olaf Using `-fno-strict-aliasing` is an acceptable solution. GCC has taken optimizations to the extreme in this case. More useful compilers implement a non-standard extension instead, and the C standard may be damned. Trying to write for example embedded systems programming without violating strict aliasing is a major pain - why embedded compilers implement deterministic non-standard behavior. The strict aliasing rule is mostly a bug in the C standard, caused by clueless PC programmers having too much influence in the C committee. — Lundin, Dec 14 '16 at 14:57
In fact, `-fno-strict-aliasing` may be your only reasonable choice here, unless I'm missing something. "Legal" type punning using a union won't work; the optimizer is too smart for it and still elides all of the intrinsics. Basically everything else I can think of that you could do would result in an explosion of code, which is precisely what you *don't* want when you're trying to write hand-optimized code using intrinsics. @lundin I'm not sure that this is actually a duplicate. Certainly the duplicate does not contain a good answer to the question, how do I make this work! — Cody Gray - on strike, Dec 14 '16 at 15:06
@Lundin: How come I manage to write large embedded projects with gcc and without need to reort to such options? What's the problem using a `union here`? gcc just follows the standard and targets optimal code. and gcc provides enough extensions where necessary to even support special constructs. E.g. compiler barriers, etc. Before starting something I would inspect the machine code generated for both optimisations. Problem is casting the `char []` to `uint64_t` is problematic also for alignment issues. If not UB, it can introduce a major performance penalty. — too honest for this site, Dec 14 '16 at 15:12
@CodyGray If you show it inside a union, then it sates the strict aliasing rule. As for if this is a duplicate or not, I think the first step here is to illuminate the OP what this is about. We don't need yet another thread explaining that. If the canonical duplicate is bad for some reason (I think not), then anyone is free to improve it. Once the OP grasps what strict aliasing is, they may of course ask a new question "how to fix strict aliasing violations in this code". — Lundin, Dec 14 '16 at 15:15
Yes, a union fixes the strict aliasing rule. **It does *not*, however, fix the problem.** The object code emitted by GCC is identical, whether you use pythonic's original code or rewrite it to use a union. As such, it does not answer the question in a useful way. But sure, fine, I guess you can always ask it as a new question. — Cody Gray - on strike, Dec 14 '16 at 15:18
@Olaf Example 1: write a generic eeprom driver. The driver must write 32 bits to memory at a time because of hardware requirements, but it must be able to accept a pointer to any form of data. Example 2: receive data from a serial bus, in the form of `uint8_t` bytes. Now convert this data to the correct, larger types. You may not use struct or union because they have padding and the code needs to be portable. And so on. You need to do a lot of fiddling and ugly patches to sort out such situations, if you care about strict aliasing. — Lundin, Dec 14 '16 at 15:21
@Lundin: The interface of such a driver is broken by design already. And typical 32 bit embedded systems support aligned accesses only, so you either pass a `uint32_t []` or (better) a `uint8_t []` and use byte-accesses. EEPROM accesses are not really time-critical. An alternative would be to leave the transfer to DMA, which might require additional measures anyway (compiler/memory barriers). For the serial transfer (very familar to me), you use proper marshaling with shifts and leave optimisation to the compiler. ... — too honest for this site, Dec 14 '16 at 15:24
... After all the major reason gcc exploits UB that much is because it tries to optimise agressively. I prefer this attitude over compilers which tolerate careless programmers. And no, there is nothing ugly about my code - the opposite is true. — too honest for this site, Dec 14 '16 at 15:26
@Olaf The alignment requirements of a eeprom/data flash memory are not necessarily the same as the CPU! Often you might have to write a whole segment at the same time, because you had to erase it before writing to it. EEPROM writes can certainly be time critical, take for example the trip meter in your car. Anyway, the point here is that the aliasing rules and effective type were introduced in C99 because of some PC compiler vendor lobbying. The committee could have made the decision to fix known cases of undefined behavior instead of adding more, but didn't. Thus this is a C standard bug IMO. — Lundin, Dec 14 '16 at 15:30
@Lundin: Depends on how the EEPROM is connected to the CPU. If that is through the normal bus-interface, of course you have to follow the CPU restrictions. For typical (external) EEPROMs you use some buffer-register anyway, so there is no direct relation between the data in memory and the EEPROM anyway and you can use bitops to merge/split bytes to words. And IIRC, the aliasing was an issue with C90 anyway. I'm fine with that and my code is neither slowed down, not ugly using proper marshalling. And C11 provides measures to support proper alignment of buffers very well. — too honest for this site, Dec 14 '16 at 15:33

Output of _m256i changes with optimization level

0 Answers0