Bug in VC++ 14.0 (2015) compiler?

Question

I've been running into some issues that only occurred during Release x86 mode and not during Release x64 or any Debug mode. I managed to reproduce the bug using the following code:

#include <stdio.h>
#include <iostream>
using namespace std;

struct WMatrix {
    float _11, _12, _13, _14;
    float _21, _22, _23, _24;
    float _31, _32, _33, _34;
    float _41, _42, _43, _44;

    WMatrix(float f11, float f12, float f13, float f14,
            float f21, float f22, float f23, float f24,
            float f31, float f32, float f33, float f34,
            float f41, float f42, float f43, float f44) :
        _11(f11), _12(f12), _13(f13), _14(f14),
        _21(f21), _22(f22), _23(f23), _24(f24),
        _31(f31), _32(f32), _33(f33), _34(f34),
        _41(f41), _42(f42), _43(f43), _44(f44) {
    }
};

void printmtx(WMatrix m1) {
    char str[256];
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._11, m1._12, m1._13, m1._14);
    cout << str << "\n";
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._21, m1._22, m1._23, m1._24);
    cout << str << "\n";
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._31, m1._32, m1._33, m1._34);
    cout << str << "\n";
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._41, m1._42, m1._43, m1._44);
    cout << str << "\n";
}

WMatrix mul1(WMatrix m, float f) {
    WMatrix out = m;
    for (unsigned int i = 0; i < 4; i++) {
        for (unsigned int j = 0; j < 4; j++) {
            unsigned int idx = i * 4 + j; // critical code
            *(&out._11 + idx) *= f; // critical code
        }
    }
    return out;
}

WMatrix mul2(WMatrix m, float f) {
    WMatrix out = m;
    unsigned int idx2 = 0;
    for (unsigned int i = 0; i < 4; i++) {
        for (unsigned int j = 0; j < 4; j++) {
            unsigned int idx = i * 4 + j; // critical code
            bool b = idx == idx2; // critical code
            *(&out._11 + idx) *= f; // critical code
            idx2++;
        }
    }
    return out;
}


int main() {
    WMatrix m1(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
    WMatrix m2 = mul1(m1, 0.5f);
    WMatrix m3 = mul2(m1, 0.5f);

    printmtx(m1);
    cout << "\n";
    printmtx(m2);
    cout << "\n";
    printmtx(m3);

    int x;
    cin >> x;
}

In the above code, mul2 works, but mul1 does not. mul1 and mul2 are simply trying to iterate over the floats in the WMatrix and multiply them by f, but the way mul1 indexes (i*4+j) somehow evaluates to incorrect results. All mul2 does different is it checks the index before using it and then it works (there are many other ways of tinkering with the index to make it work). Notice if you remove the line "bool b = idx == idx2" then mul2 also breaks...

Here is the output:

1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000

0.500, 0.500, 0.375, 0.250
0.625, 1.500, 3.500, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000

0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000

Correct output should be...

1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000

0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000

0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000

Am I missing something? Or is it actually a bug in the compiler?

`*(&out._11 + idx)` is undefined behavior. You can't make that assumption about the layout of the struct. And because the compiler expects you not to make such assumptions, it's free to optimize around it. Though this is something I'd expect out of GCC rather than MSVC. Does the compiler give you any warnings? — Mysticial, Aug 29 '16 at 15:20
It might not matter in this case, but you should add `#pragma pack(push, 1)` before and `#pragma pack(pop)` after the struct. — wimh, Aug 29 '16 at 15:23
You might want to check the disassembly to be sure of what's happening. — Banex, Aug 29 '16 at 15:25
Is there any reason why you don't use an array/4 arrays/an array of 4 arrays? — Simon Kraemer, Aug 29 '16 at 15:25
Instead of having separate member variables why not use a single dimensional array like all of the matrix libraries do? — NathanOliver, Aug 29 '16 at 15:25
*Bug in VC++ 14.0 (2015) compiler* -- You really shouldn't write title's like this until you confirm it is a bug. Titles like this gives the impression that the VS engineers are incompetent or just plain sloppy. IMO, the only persons who can truly state "bug in VC++" or any other compilers are vastly experienced C++ programmers who have a simple program to duplicate the error, and would also produce assembly language listing showing that the compiler is not producing the correct code. — PaulMcKenzie, Aug 29 '16 at 15:27
To continue -- also such folks would provide the line and verse from the ANSI standard stating what the behavior should be, and what the compiler is actually doing. Anything short of this, then it isn't a "bug", except for the bug in your code. — PaulMcKenzie, Aug 29 '16 at 15:35
@PaulMcKenzie Couldn't agree more. Before you can say it is a compiler bug you need to make sure you have conformant code. — NathanOliver, Aug 29 '16 at 15:38
Thanks for the replies. No it does not generate any warnings. I've seen implementations do this behavior to structs repeatedly, I didn't know it was undefined. I don't use an array for compatibility reasons. — Hasan Al-Jawahiri, Aug 29 '16 at 15:51
If anyone is interested, the optimization that is "breaking" this is most likely Scalar Replacement of Aggregates. The compiler is able to prove that matrices don't alias with anything and that nothing (except for `_11`) has its address taken. So the compiler is "pulling" the rest of the fields out of the struct and optimizing them as scalars rather than as members with a specific location in memory. That's why `*(&out._11 + idx)` doesn't work because the fields are no longer where you expect them to be. — Mysticial, Aug 29 '16 at 16:10

score 4 · Answer 1 · edited May 23 '17 at 12:32

This afflicts only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem manifest in 32-bit builds whether optimizing for speed (/O2) or size (/O1). As you mentioned, it works as expected in debugging builds with optimization disabled.

Wimmel's suggestion of changing the packing, accurate though it is, does not change the behavior. (The code below assumes the packing is correctly set to 1 for WMatrix.)

I can't reproduce it in VS 2010, but I can in VS 2013 and 2015. I don't have 2012 installed. That's good enough, though, to allow us to analyze the difference between the object code produced by the two compilers.

Here is the code for mul1 from VS 2010 (the "working" code):
_{(Actually, in many cases, the compiler inlined the code from this function at the call site. But the compiler will still output disassembly files containing the code it generated for the individual functions prior to inlining. That's what we're looking at here, because it is more cluttered. The behavior of the code is entirely equivalent whether it's been inlined or not.)}

PUBLIC  mul1
_TEXT   SEGMENT
_m$ = 8                     ; size = 64
_f$ = 72                        ; size = 4
mul1 PROC
 ___$ReturnUdt$ = eax

    push    esi
    push    edi

    ; WMatrix out = m;

    mov ecx, 16                 ; 00000010H
    lea esi, DWORD PTR _m$[esp+4]
    mov edi, eax
    rep movsd

    ; for (unsigned int i = 0; i < 4; i++)
    ; {
    ;    for (unsigned int j = 0; j < 4; j++)
    ;    {
    ;       unsigned int idx = i * 4 + j; // critical code
    ;       *(&out._11 + idx) *= f; // critical code

    movss   xmm0, DWORD PTR [eax]
    cvtps2pd xmm1, xmm0
    movss   xmm0, DWORD PTR _f$[esp+4]
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax], xmm1
    movss   xmm1, DWORD PTR [eax+4]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+4], xmm1
    movss   xmm1, DWORD PTR [eax+8]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+8], xmm1
    movss   xmm1, DWORD PTR [eax+12]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+12], xmm1
    movss   xmm2, DWORD PTR [eax+16]
    cvtps2pd xmm2, xmm2
    cvtps2pd xmm1, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+16], xmm1
    movss   xmm1, DWORD PTR [eax+20]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+20], xmm1
    movss   xmm1, DWORD PTR [eax+24]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+24], xmm1
    movss   xmm1, DWORD PTR [eax+28]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+28], xmm1
    movss   xmm1, DWORD PTR [eax+32]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+32], xmm1
    movss   xmm1, DWORD PTR [eax+36]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+36], xmm1
    movss   xmm2, DWORD PTR [eax+40]
    cvtps2pd xmm2, xmm2
    cvtps2pd xmm1, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+40], xmm1
    movss   xmm1, DWORD PTR [eax+44]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+44], xmm1
    movss   xmm2, DWORD PTR [eax+48]
    cvtps2pd xmm1, xmm0
    cvtps2pd xmm2, xmm2
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+48], xmm1
    movss   xmm1, DWORD PTR [eax+52]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+52], xmm1
    movss   xmm1, DWORD PTR [eax+56]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    cvtps2pd xmm0, xmm0
    movss   DWORD PTR [eax+56], xmm1
    movss   xmm1, DWORD PTR [eax+60]
    cvtps2pd xmm1, xmm1
    mulsd   xmm1, xmm0
    pop edi
    cvtpd2ps xmm0, xmm1
    movss   DWORD PTR [eax+60], xmm0
    pop esi

    ; return out;
    ret 0
mul1 ENDP

Compare that to the code for mul1 generated by VS 2015:

mul1 PROC
_m$ = 8                         ; size = 64
; ___$ReturnUdt$ = ecx
; _f$ = xmm2s

    ; WMatrix out = m;

    movups  xmm0, XMMWORD PTR _m$[esp-4]

    ; for (unsigned int i = 0; i < 4; i++)

    xor eax, eax
    movaps  xmm1, xmm2
    movups  XMMWORD PTR [ecx], xmm0
    movups  xmm0, XMMWORD PTR _m$[esp+12]
    shufps  xmm1, xmm1, 0
    movups  XMMWORD PTR [ecx+16], xmm0
    movups  xmm0, XMMWORD PTR _m$[esp+28]
    movups  XMMWORD PTR [ecx+32], xmm0
    movups  xmm0, XMMWORD PTR _m$[esp+44]
    movups  XMMWORD PTR [ecx+48], xmm0
    npad    4
$LL4@mul1:

    ; for (unsigned int j = 0; j < 4; j++)
    ; {
    ;    unsigned int idx = i * 4 + j; // critical code
    ;    *(&out._11 + idx) *= f; // critical code

    movups  xmm0, XMMWORD PTR [ecx+eax*4]
    mulps   xmm0, xmm1
    movups  XMMWORD PTR [ecx+eax*4], xmm0
    inc eax
    cmp eax, 4
    jb  SHORT $LL4@mul1

    ; return out;
    mov eax, ecx
    ret 0
?mul1@@YA?AUWMatrix@@U1@M@Z ENDP            ; mul1
_TEXT   ENDS

It is immediately obvious how much shorter the code is. Apparently the optimizer got a lot smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer's "smarts" is the exploitation of bugs in your code.

Looking at the code that matches up with the loops, you can see that VS 2010 is unrolling the loops. All of the computations are done inline so that there are no branches. This is kind of what you'd expect for loops with upper and lower bounds that are known at compile time and, as in this case, reasonably small.

What happened in VS 2015? Well, it didn't unroll anything. There are 5 lines of code, and then a conditional jump JB back to the top of the loop sequence. That alone doesn't tell you much. What does look highly suspicious is that it only loops 4 times (see the cmp eax, 4 statement that sets flags right before doing the jb, effectively continuing the loop as long as the counter is less than 4). Well, that might be okay if it had merged the two loops into one. Let's see what it's doing inside of the loop:

$LL4@mul1:
  movups  xmm0, XMMWORD PTR [ecx+eax*4]   ; load a packed unaligned value into XMM0
  mulps   xmm0, xmm1                      ; do a packed multiplication of XMM0 by XMM1,
                                          ;  storing the result in XMM0
  movups  XMMWORD PTR [ecx+eax*4], xmm0   ; store the result of the previous multiplication
                                          ;  back into the memory location that we
                                          ;  initially loaded from

  inc      eax                            ; one iteration done, increment loop counter
  cmp      eax, 4                         ; see how many loops we've done
  jb       $LL4@mul1                      ; keep looping if < 4 iterations

The code reads a value from memory (an XMM-sized value from the location determined by ecx + eax * 4) into XMM0, multiplies it by a value in XMM1 (which was set outside the loop, based on the f parameter), and then stores the result back into the original memory location.

Compare that to the code for the corresponding loop in mul2:

$LL4@mul2:
  lea     eax, DWORD PTR [eax+16]
  movups  xmm0, XMMWORD PTR [eax-24]
  mulps   xmm0, xmm2
  movups  XMMWORD PTR [eax-24], xmm0
  sub     ecx, 1
  jne     $LL4@mul2

Aside from a different loop control sequence (this sets ECX to 4 outside of the loop, subtracts 1 each time through, and keeps looping as long as ECX != 0), the big difference here is the actual XMM values that it manipulates in memory. Instead of loading from [ecx+eax*4], it loads from [eax-24] (after having previously added 16 to EAX).

What's different about mul2? You had added code to track a separate index in idx2, incrementing it each time through the loop. Now, this alone would not be enough. If you comment out the assignment to the bool variable b, mul1 and mul2 result in identical object code. Clearly without the comparison of idx to idx2, the compiler is able to deduce that idx2 is completely unused, and therefore eliminate it, turning mul2 into mul1. But with that comparison, the compiler apparently becomes unable to eliminate idx2, and its presence ever so slightly changes what optimizations are deemed possible for the function, resulting in the output discrepancy.

Now the question turns to why is this happening. Is it an optimizer bug, as you first suspected? Well, no—and as some of the commenters have mentioned, it should never be your first instinct to blame the compiler/optimizer. Always assume that there are bugs in your code unless you can prove otherwise. That proof would always involve looking at the disassembly, and preferably referencing the relevant portions of the language standard if you really want to be taken seriously.

In this case, Mystical has already nailed the problem. Your code exhibits undefined behavior when it does *(&out._11 + idx). This makes certain assumptions about the layout of the WMatrix struct in memory, which you cannot legally make, even after explicitly setting the packing.

This is why undefined behavior is evil—it results in code that seems to work sometimes, but other times it doesn't. It is very sensitive to compiler flags, especially optimizations, but also target platforms (as we saw at the top of this answer). mul2 only works by accident. Both mul1 and mul2 are wrong. Unfortunately, the bug is in your code. Worse, the compiler didn't issue a warning that might have alerted you to your use of undefined behavior.

What assumptions can one not make about a struct? It seems like common practice to assume how a struct is laid out. For example, in graphics, you pass "vertex buffers" to the GPU, telling it how the struct looks like by specifying to the graphics API the offset of each member into the struct. This way, the GPU can interpret what the passed in buffer contains and pass the correct values to the pipeline. How come its undefined? — Hasan Al-Jawahiri, Aug 29 '16 at 18:37
*"What assumptions can one not make about a struct?"* In general, none. You cannot assume anything about the layout of the struct. And there is never a need to. Just write the code out explicitly, let the optimizer deal with making it optimal. Specifically, the problem here is assuming that the struct is laid out in memory like an array. I don't know what "vertex buffers" have to do with any of this, or what type of graphics programming you're talking about. When you pass data to the GPU, the GPU is responsible for the interpretation of it, not the C++ compiler. @hasan — Cody Gray - on strike, Aug 30 '16 at 06:42
@hasan: you can rely on the order but not the offset of members, and that only as long as there are no intermediate access specifiers. — MSalters, Aug 30 '16 at 07:41

score 2 · Answer 2 · answered Aug 29 '16 at 16:34

If we look at the generated code, the problem is fairly clear. Ignoring a few bits and pieces that aren't related to the problem at hand, mul1 produces code like this:

movss   xmm1, DWORD PTR _f$[esp-4] ; load xmm1 from _11 of source
; ...

shufps  xmm1, xmm1, 0               ; duplicate _11 across floats of xmm1
; ...

for ecx = 0 to 3 {
    movups  xmm0, XMMWORD PTR [dest+ecx*4] ; load 4 floats from dest
    mulps   xmm0, xmm1                     ; multiply each by _11
    movups  XMMWORD PTR [dest+ecx*4], xmm0 ; store result back to dest
}

So, instead of multiplying each element of one matrix by the corresponding element of the other matrix, it's multiplying each element of one matrix by _11 of the other matrix.

Although it's impossible to confirm exactly how it happened (without looking through the compiler's source code), this certainly fits with @Mysticial's guess about how the problem arose.

Bug in VC++ 14.0 (2015) compiler?

2 Answers2

Linked