12

Let's mess around with very basic dynamically allocated memory. We take a vector of 3, set its elements and return the sum of the vector.

In the first test case I used a raw pointer with new[]/delete[]. In the second I used std::vector:

#include <vector>   

int main()
{
  //int *v = new int[3];        // (1)
  auto v = std::vector<int>(3); // (2)


  for (int i = 0; i < 3; ++i)
    v[i] = i + 1;

  int s = 0;
  for (int i = 0; i < 3; ++i)
    s += v[i];

  //delete[] v;                 // (1)
  return s;
}

Assembly of (1) (new[]/delete[])

main:                                   # @main
        mov     eax, 6
        ret

Assembly of (2) (std::vector)

main:                                   # @main
        push    rax
        mov     edi, 12
        call    operator new(unsigned long)
        mov     qword ptr [rax], 0
        movabs  rcx, 8589934593
        mov     qword ptr [rax], rcx
        mov     dword ptr [rax + 8], 3
        test    rax, rax
        je      .LBB0_2
        mov     rdi, rax
        call    operator delete(void*)
.LBB0_2:                                # %std::vector<int, std::allocator<int> >::~vector() [clone .exit]
        mov     eax, 6
        pop     rdx
        ret

Both outputs taken from https://gcc.godbolt.org/ with -std=c++14 -O3

In both versions the returned value is computed at compile time so we see just mov eax, 6; ret.

With the raw new[]/delete[] the dynamic allocation was completely removed. With std::vector however, the memory is allocated, set and freed.

This happens even with an unused variable auto v = std::vector<int>(3): call to new, memory is set and then call to delete.

I realize this is most likely a near impossible answer to give, but maybe someone has some insights and some interesting answers might pop out.

What are the contributing factors that don't allow compiler optimizations to remove the memory allocation in the std::vector case, like in the raw memory allocation case?

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
bolov
  • 72,283
  • 15
  • 145
  • 224
  • 1
    Your `std::vector` could conceivably grow beyond its initial size, requiring another memory allocation cycle (in a different member function of `std::vector`). The compiler is apparantly not able to tell this doesn't happen in the scope of your program. In a non-trivial example, it could be very much non-trivial to properly *check* for the condition, and chances that it is possible are low (or an experienced programmer wouldn't have used `std::vector` in the first place) so the compiler does not bother. Your array could not grow, so optimizing its memory handling away is a *much* easier task. – DevSolar Jan 04 '16 at 12:17
  • @DevSolar even if it grew, that would call a `new[]`, a copy and a `delete`. Thinks which can be done with the pointer in the (1) case. All it sees in both cases, I think, are the calls to `new`, `mov` and `delete` (`std::vector` has internally a pointer). I don't think it takes into account, what operation could be done in the future. – bolov Jan 04 '16 at 12:23
  • @bolov - I think DevSolar is trying to tell you that the use case of optimizing out a vector is so unusual that the compiler writers haven't bothered. There are much more low hanging fruit to pick! – Bo Persson Jan 04 '16 at 12:29
  • @bolov: Consider: 1) The compiler is looking at a syntax abstraction, not assembly, when doing the optimization. 2) Optimization is a tradeoff between a) work done on developing the optimizer, b) time and memory spent optimizing, and c) resulting speed-up. While the resulting assembly *looks* like you should be able to optimize that memory allocation away, this specific case does not happen that often in "real" code; and it would require quite some resources both by the developers and the compiler to check if a given code allows for that kind of optimization. It's not worth it. (What Bo said.) – DevSolar Jan 04 '16 at 12:31
  • @DevSolar I see your point – bolov Jan 04 '16 at 12:42
  • 1
    @DevSolar *Only* at syntax abstraction? Isn't there an extra stage involved? P.S. It's worth noting that it doesn't only not optimize away the allocation, it also makes sure to insert the values into the vector's memory, which could be a lead here. – Yam Marcovic Jan 04 '16 at 12:42
  • @YamMarcovic: Last time I looked at the gory innards of GCC code generation, optimization was done on something called "GIMPLE". I am quite sure clang and MSVC use different terms for different things. I am also sure that at least one of those reserved the term "syntax abstraction" for something specific running completely contrary to the point I wanted to make -- which was, code optimization is not done at Assembly level. I'll leave more specific statements to the gurus, which is why these are comments, not answers. – DevSolar Jan 04 '16 at 12:52
  • I'm not convinced that Clang is allowed to optimize this away even if it wanted to. `operator new` and `operator delete` are replaceable, and there might be another TU that provides a replacement version with side effects. [N3664](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html)'s special leeway applies only to `new`/`delete` expressions, not calls to `operator new` directly (which is what `std::allocator` does). – T.C. Jan 04 '16 at 13:14
  • I am no expert but maybe you can try with std::array instead of std::vector? std::array more closely resembles a plain array since its size cannot be changed during runtime. – AndersK Jan 04 '16 at 13:18
  • 2
    I didn't think this was an interesting question until I read Hadi Brais's answer. – Adrian McCarthy Jan 04 '16 at 18:45

1 Answers1

14

When using a pointer to a dynamically allocated array (directly using new[] and delete[]), the compiler optimized away the calls to operator new and operator delete even though they have observable side effects. This optimization is allowed by the C++ standard section 5.3.4 paragraph 10:

An implementation is allowed to omit a call to a replaceable global allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage is instead provided by the implementation or...

I'll show the rest of the sentence, which is crucial, at the end.

This optimization is relatively new because it was first allowed in C++14 (proposal N3664). Clang supported it since 3.4. The latest version of gcc, namely 5.3.0, doesn't take advantage of this relaxation of the as-if rule. It produces the following code:

main:
        sub     rsp, 8
        mov     edi, 12
        call    operator new[](unsigned long)
        mov     DWORD PTR [rax], 1
        mov     DWORD PTR [rax+4], 2
        mov     rdi, rax
        mov     DWORD PTR [rax+8], 3
        call    operator delete[](void*)
        mov     eax, 6
        add     rsp, 8
        ret

MSVC 2013 also doesn't support this optimization. It produces the following code:

main:
  sub         rsp,28h  
  mov         ecx,0Ch  
  call        operator new[] ()  
  mov         rcx,rax  
  mov         dword ptr [rax],1  
  mov         dword ptr [rax+4],2  
  mov         dword ptr [rax+8],3  
  call        operator delete[] ()  
  mov         eax,6  
  add         rsp,28h  
  ret 

I currently don't have access to MSVC 2015 Update 1 and therefore I don't know whether it supports this optimization or not.

Finally, here is the assembly code generated by icc 13.0.1:

main:
        push      rbp                                          
        mov       rbp, rsp                                   
        and       rsp, -128                                    
        sub       rsp, 128                                     
        mov       edi, 3                                       
        call      __intel_new_proc_init                         
        stmxcsr   DWORD PTR [rsp]                               
        mov       edi, 12                                 
        or        DWORD PTR [rsp], 32832                       
        ldmxcsr   DWORD PTR [rsp]                               
        call      operator new[](unsigned long)
        mov       rdi, rax                                      
        mov       DWORD PTR [rax], 1                            
        mov       DWORD PTR [4+rax], 2                          
        mov       DWORD PTR [8+rax], 3                         
        call      operator delete[](void*)
        mov       eax, 6    
        mov       rsp, rbp                           
        pop       rbp                                   
        ret                                          

Clearly, it doesn't support this optimization. I don't have access to the latest version of icc, namely 16.0.

All of these code snippets have been produced with optimizations enabled.

When using std::vector, all of these compilers didn't optimize away the allocation. When a compiler doesn't perform an optimization, it's either because it cannot for some reason or it's just not yet supported.

What are the contributing factors that don't allow compiler optimizations to remove the memory allocation in the std::vector case, like in the raw memory allocation case?

The compiler didn't perform the optimization because it's not allowed to. To see this, let's see the rest of the sentence of paragraph 10 from 5.3.4:

An implementation is allowed to omit a call to a replaceable global allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage is instead provided by the implementation or provided by extending the allocation of another new-expression.

What this is saying is that you can omit a call to a replaceable global allocation function only if it originated from a new-expression. A new-expression is defined in paragraph 1 of the same section.

The following expression

new int[3]

is a new-expression and therefore the compiler is allowed to optimize away the associated allocation function call.

On the other hand, the following expression:

::operator new(12)

is NOT a new-expression (see 5.3.4 paragraph 1). This is just a function call expression. In other words, this is treated as a typical function call. This function cannot be optimized away because its imported from another shared library (even if you linked the runtime statically, the function itself calls another imported function).

The default allocator used by std::vector allocates memory using ::operator new and therefore the compiler is not allowed to optimize it away.

Let's test this. Here's the code:

int main()
{
  int *v =  (int*)::operator new(12);

  for (int i = 0; i < 3; ++i)
    v[i] = i + 1;

  int s = 0;
  for (int i = 0; i < 3; ++i)
    s += v[i];

  delete v;
  return s;
}

By compiling using Clang 3.7, we get the following assembly code:

main:                                   # @main
        push    rax
        mov     edi, 12
        call    operator new(unsigned long)
        movabs  rcx, 8589934593
        mov     qword ptr [rax], rcx
        mov     dword ptr [rax + 8], 3
        test    rax, rax
        je      .LBB0_2
        mov     rdi, rax
        call    operator delete(void*)
.LBB0_2:
        mov     eax, 6
        pop     rdx
        ret

This is exactly the same as assembly code generated when using std::vector except for mov qword ptr [rax], 0 which comes from the constructor of std::vector (the compiler should have removed it but failed to do so because of a flaw in its optimization algorithms).

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • 1
    And conversely, if you replace (in the instantiation of `std::vector`) `std::allocator` with your own allocator that calls `new[]` and `delete[]` in place of `::operator new` and `::operator delete`, the allocator calls will be optimized away. – ecatmur Jan 04 '16 at 18:37
  • 1
    @ecatmur Except that `new[]` will default-construct everything, and `delete[]` will likely result in double destruction. – T.C. Jan 04 '16 at 21:15
  • @T.C. Use `new char[n * sizeof(T)]`, or `new std::aligned_union_t<0, T>[n]` if you want to support over-aligned types (not required). – ecatmur Jan 04 '16 at 22:37
  • Is your interpretation that "you can omit a call to a replaceable global allocation function only if it originated from a new-expression" based solely on the word "another" in "another new-expression"? As in, the standard appears to say you can omit global allocation functions, but because they mention one way to do it is extending _another_ new expression, it prohibits replacing any other type of global allocation function? If that's the case I don't agree. For one thing, I don't think it makes sense. – BeeOnRope Mar 02 '19 at 17:22
  • For another thing, even if we want to language lawyer the standard, I think you can argue that if the implied restriction really exists, it would only apply to the second option, so the last sentence in that rule would be interpreted as "When it does so, the storage is instead provided by the implementation or, _if the call originated from a new expression_, the storage may be provided by extending the allocation of another new-expression.". – BeeOnRope Mar 02 '19 at 17:24
  • 1
    @BeeOnRope That part of the standard discusses the rules for omitting calls to allocation functions from a new expression. If a memory allocation function is being called like a normal function (not from a new expression) then the call to that function is not subject to these rules and the as-if rules would apply to it. For example, if you define your own allocation function, then sure the compiler can optimize a call to it away if it can prove that the observable behavior will not change... – Hadi Brais Mar 02 '19 at 18:00
  • ...But this is not usually possible with the allocation calls from new expressions because they are to imported functions that the compiler cannot analyze. That's why these special rules for optimizing new expressions exist. – Hadi Brais Mar 02 '19 at 18:00
  • @HadiBrais - on reading the surrounding section, I agree with you - what I say above is wrong. Interestingly, an [earlier version of the proposal](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3537.html), N3537, did propose working directly on the global allocation functions, and didn't mention new expressions. In N3664 they changed strategies and targeted new expressions instead, and that's what got adopted into the standard. Performance wise it is unfortunate, because `std::allocator` uses decoupled allocation and placement via the global operators, so this optimization fails. – BeeOnRope Mar 02 '19 at 18:14