Visual C++ optimization options - how to improve the code output?

Question

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard. Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.

The fairly simple C++11 code looks like this:

#include <vector>
int main() {
    std::vector<int> v = {1, 2, 3, 4};
    int sum = 0;
    for(int i = 0; i < v.size(); i++) {
        sum += v[i];
    }
    return sum;
}

Clang's output with libc++ is quite compact:

main: # @main
  mov eax, 10
  ret

Visual C++ output, on the other hand, is a multi-page mess. Am I missing something here or is VS really this bad?

Compiler explorer link: https://godbolt.org/g/GJYHjE

Interesting question. Didn't know that VS generates so much code even from some constants. But still I think that first programmer must think about code he writes, but not how the compiler would optimize it. — Nikita Smirnov, Feb 02 '18 at 15:37
I just ran a quick test of this, Release compiled in VS 2018 (141 toolset) and Intel 18.0 Both produced a reasonable amount of code. Looks like Clang has done a static analysis on the function, and determined that it has no runtime dependent operations, and so calculated the only possible result, and optimised it down to return that, and nothing more. I'd like to see what Clang produces if you have an indeterminate in there. — Rags, Feb 02 '18 at 15:44
GCC also produces very little code (although not as good as clang/libc++). Interestingly, clang/libstdc++ (gcc's standard library) produces more code than clang/libc++. — Alexander, Feb 02 '18 at 16:44
@Rags If you replace the definition with: "std::vector v = {1, argc, 3, 4};" clang output is: " lea eax, [rdi + 8] ret" — Alexander, Feb 02 '18 at 16:45
@ThomasMatthews As you can see from compiler explorer, I'm not using VS, it's cl /O2. So no "Release" mode or any IDE-specific stuff like that. — Alexander, Feb 02 '18 at 17:53
@Alexander That's nice. How about an actual indeterminate? Like, a rand(), or a user input? Still impressive that the optimiser can reduce it that much though. — Rags, Feb 05 '18 at 08:46
@Rags Not exactly rand() or user input, just indeterminate function arguments, overall looking like a somewhat real-world example: [compiler explorer link](https://godbolt.org/g/f5pcx2) Very impressive results, I'd say. — Alexander, Feb 05 '18 at 12:44

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.

Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.

Constant Propagation

In the example, the vector is statically initialized:

std::vector<int> v = {1, 2, 3, 4};

Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.

Here's the abbreviated VS code for doing this:

movdqa   xmm0, XMMWORD PTR __xmm@00000004000000030000000200000001
...
movdqu   XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4@main:
    add      ebx, DWORD PTR [rdx]   ; loop and sum the values
    lea      rdx, QWORD PTR [rdx+4]
    inc      r8d
    movsxd   rax, r8d
    cmp      rax, r9
    jb       SHORT $LL4@main

Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.

Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.

New/Delete Optimization

Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper). This has already generated much discussion on SO:

Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.

Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):

call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>

and

call void __cdecl operator delete(void * __ptr64)

The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).

This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:

[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.

Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.

STL Implementation

VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.

For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!

main:                                   # @main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
    .seh_handler __CxxFrameHandler3, @unwind, @except
# BB#0:                                 # %.lr.ph
    pushq   %rbp
.Lcfi1:
    .seh_pushreg 5
    pushq   %rsi
.Lcfi2:
    .seh_pushreg 6
    pushq   %rdi
.Lcfi3:
    .seh_pushreg 7
    pushq   %rbx
.Lcfi4:
    .seh_pushreg 3
    subq    $72, %rsp
.Lcfi5:
    .seh_stackalloc 72
    leaq    64(%rsp), %rbp
.Lcfi6:
    .seh_setframe 5, 64
.Lcfi7:
    .seh_endprologue
    movq    $-2, (%rbp)
    movl    $16, %ecx
    callq   "??2@YAPEAX_K@Z"
    movq    %rax, -24(%rbp)
    leaq    16(%rax), %rcx
    movq    %rcx, -8(%rbp)
    movups  .L.ref.tmp(%rip), %xmm0
    movups  %xmm0, (%rax)
    movq    %rcx, -16(%rbp)
    movl    4(%rax), %ebx
    movl    8(%rax), %esi
    movl    12(%rax), %edi
.Ltmp0:
    leaq    -24(%rbp), %rcx
    callq   "?_Tidy@?$vector@HV?$allocator@H@std@@@std@@IEAAXXZ"
.Ltmp1:
# BB#1:                                 # %"\01??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ.exit"
    addl    %ebx, %esi
    leal    1(%rdi,%rsi), %eax
    addq    $72, %rsp
    popq    %rbx
    popq    %rdi
    popq    %rsi
    popq    %rbp
    retq
    .seh_handlerdata
    .long   ($cppxdata$main)@IMGREL
    .text

Update - Visual Studio 2017

In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:

Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.

Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).

Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.

Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.

However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:

main:                                   # @main
.Lcfi0:
.seh_proc main
# BB#0:
    subq    $40, %rsp
.Lcfi1:
    .seh_stackalloc 40
.Lcfi2:
    .seh_endprologue
    movl    $16, %ecx
    callq   "??2@YAPEAX_K@Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
    movq    %rax, %rcx
    callq   "??3@YAXPEAX@Z" ; void __cdecl operator delete(void * __ptr64)
    movl    $10, %eax
    addq    $40, %rsp
    retq
    .seh_handlerdata
    .text

Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.

I'd say that is pretty amazing!

Function Inlining

Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.

When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.

memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.

Summary

To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.

While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

Thanks! What about gcc? It does have new/delete in its output, the the code is still dramatically shorter than VS. [compiler explorer link](https://godbolt.org/g/gKkEkH) — Alexander, Feb 03 '18 at 20:05
@Alexander Right - it seems that both gcc and clang are also able to optimize out the std::vector pattern very early and fold everything into the expected result, and VS doesn't do that - this is probably also key to VS bloated code. — valiano, Feb 04 '18 at 07:34
@Alexander the more I thought of this, it came to better realization of what optimizations clang and VS are doing, so I thought to expand my answer accordingly - please see above — valiano, Feb 05 '18 at 22:58
I wish I could upvote you twice! Thanks for such a detailed answer! Even with a somewhat real-world-looking example the difference is astounding: [compiler explorer link](https://godbolt.org/g/f5pcx2). — Alexander, Feb 07 '18 at 12:32
The fact that clang is not able to optimize away new/delete with libstdc++, makes me think that either there's some pattern matching or special annotation involved, or there is some special libc++ operator new/delete detection, or maybe something to do with the default allocator. I guess clang/libstdc++ optimization is not particularly important to them, as it's mainly used for development only. — Alexander, Feb 07 '18 at 12:34
@Alexander - if I'm to guess, I'd say that under the `-stdlib=libc++` flag, clang is "comfortable" in performing new/delete optimization - this is probably a concious decision, relying on clang developers knowledge of libc++ internals (which are tightly coupled projects), whereas, stdlibc++ is out of control of clang/libc++ developers. I'm not sure why in libc++ the new/delete optimization is allowed, though - it'll be interesting to try locate this decision in the clang code. — valiano, Feb 08 '18 at 07:25
@Alexander I updated my answer with some findings about Visual Studio 2017, you may find them interesting. You really have to give it to clang! — valiano, Feb 08 '18 at 07:32
I think I found the exact reason why clang is able to remove new/delete in libc++ only. See __builtin_operator_new and __builtin_operator_delete description here: [clang manual](https://clang.llvm.org/docs/LanguageExtensions.html) — Alexander, Feb 08 '18 at 13:16
Thanks for the VS2017 library test. As you may know, an experimental version of clang is bundled with VS2017 now. Your test shows me that once it's stable, I will be able to move my Windows builds from cl to clang and have a pretty good justification for it. — Alexander, Feb 08 '18 at 13:20
@Alexander __builtin_operator_new - that's it! Great to know — valiano, Feb 08 '18 at 13:39

Visual C++ optimization options - how to improve the code output?

1 Answers1