Can creation of composite objects from temporaries be optimised away?

Question

I've asked a few questions which have touched around this issue, but I've been getting differing responses, so I thought best to ask it directly.

Lets say we have the following code:

// Silly examples of A and B, don't take so seriously, 
// just keep in mind they're big and not dynamically allocated.
struct A { int x[1000]; A() { for (int i = 0; i != 1000; ++i) { x[i] = i * 2; } };
struct B { int y[1000]; B() { for (int i = 0; i != 1000; ++i) { y[i] = i * 3; } };

struct C
{
  A a;
  B b;
};

A create_a() { return A(); }
B create_b() { return B(); }

C create_c(A&& a, B&& b)
{
  C c;
  c.a = std::move(a);
  c.b = std::move(b);
  return C; 
};

int main()
{
  C x = create_c(create_a(), create_b());
}

Now ideally create_c(A&&, B&&) should be a no-op. Instead of the calling convention being for A and B to be created and references to them passed on stack, A and B should created and passed in by value in the place of the return value, c. With NRVO, this will mean creating and passing them directly into x, with no further work for the function create_c to do.

This would avoid the need to create copies of A and B.

Is there any way to allow/encourage/force this behavior from a compiler, or do optimizing compilers generally do this anyway? And will this only work when the compiler inline the functions, or will it work across compilation units.

(How I think this could work across compilation units...)

If create_a() and create_b() took a hidden parameter of where to place the return value, they could place the results into x directly, which is then passed by reference to create_c() which needs to do nothing and immediately returns.

David Rodríguez - dribeas · Accepted Answer · 2011-05-03T07:39:06.543

There are different ways of optimizing the code that you have, but rvalue references are not one. The problem is that neither A nor B can be moved at no cost, since you cannot steal the contents of the object. Consider the following example:

template <typename T>
class simple_vector {
   typedef T element_type;
   typedef element_type* pointer_type;
   pointer_type first, last, end_storage;
public:
   simple_vector() : first(), last(), end_storage() {}
   simple_vector( simple_vector const & rhs )              // not production ready, memory can leak from here!
      : first( new element_type[ rhs.last - rhs.first ] ),
        last( first + rhs.last-rhs.first ),
        end_storage( last )
   {
       std::copy( rhs.first, rhs.last, first );
   }
   simple_vector( simple_vector && rhs ) // we can move!
      : first( rhs.first ), last( rhs.last ), end_storage( rhs.end_storage )
   {
      rhs.first = rhs.last = rhs.end_storage = 0;
   }
   ~simple_vector() {
      delete [] rhs.first;
   }
   // rest of operations
};

In this example, as the resources are held through pointers, there is a simple way of moving the object (i.e. stealing the contents of the old object into the new one and leaving the old object in a destroyable but useless state. Simply copy the pointers and reset them in the old object to null so that the original object destructor will not free the memory.

The problem with both A and B is that the actual memory is held in the object through an array, and that array cannot be moved to a different memory location for the new C object.

Of course, since you are using stack allocated objects in the code, the old (N)RVO can be used by the compiler, and when you do: C c = { create_a(), create_b() }; the compiler can perform that optimization (basically set the attribute c.a on the address of the returned object from create_a, while when compiling create_a, create the returned temporary directly over that same address, so effectively, c.a, the returned object from create_a and the temporary constructed inside create_a (implicit this to the constructor) are the same object, avoiding two copies. The same can be done with c.b, avoiding the copying cost. If the compiler does inline your code, it will remove create_c and replace it with a construct similar to: C c = {create_a(), create_b()}; so it can potentially optimize all copies away.

Note on the other hand, that this optimization cannot be completely used in the case of a C object allocated dynamically as in C* p = new C; p->a = create_a();, since the destination is not in the stack, the compiler can only optimize the temporary inside create_a and its return value, but it cannot make that coincide with p->a, so a copy will need to be done. This is the advantage of rvalue-references over (N)RVO, but as mentioned before you cannot do use effectively rvalue-references in your code example directly.

Billy ONeal · Answer 2 · 2011-05-03T06:52:16.347

There are two kinds of optimization which can apply in your case:

Function Inlining (In the case of A, B, and C (and the A and B it contains))
Copy elision (C (and the A and B it contains) only, because you returned C by value)

For a function this small, it's probably going to be inlined. Most any compiler will do it if it exists in the same translation unit, and good compilers like MSVC++ and G++ (and I think LLVM but I'm not sure on that one) have whole-program-optimization settings which will do it even across translation units. If the function is inlined, then yes, the function call (And the copy that comes with it) aren't going to occur at all.

If for some reason the function doesn't get inlined (i.e. you used __declspec(noinline) on MSVC++), then you're still going to be eligible for the Named Return Value Optimization (NRO), which good C++ compilers (again, MSVC++, G++, and I think LLVM) all implement. Basically, the standard says that the compilers are allowed to not perform the copy on return if they can avoid doing so, and they will usually emit code that avoids it. There are some things you can do to deactivate NRVO, but for the most part it's a pretty safe optimization to rely on.

Finally, profile. If you see a performance problem, then figure out something else. Otherwise I'd write things in the ideomatic way and replace them with more performant constructs if and only if you need to.

score 0 · Answer 3 · answered May 03 '11 at 06:37

0

Isn't the obvious thing to do to give C a constructor and then say:

C create_c(const A & a, const B & b)
{
  return C( a, b );
}

which has lots of possibilities for optimisation. Or indeed get rid of the create function. I don't think this is a very good motivating example.

answered May 03 '11 at 06:37

@unapersson: How does this eliminate the copies of A and B. Could you elaborate? – Clinton May 03 '11 at 06:39
@Clinton You can't eliminate ALL the copies of A and B, as C has got to end up containing copies. – May 03 '11 at 06:41
@unapersson : No, in the code Clinton has shown, there are 0 copies. RVO eliminates the copies returned from `create_a` and `create_b`, and `create_c` takes rvalue references and *moves* them into its `C` instance. The only needless inefficiency in this example is that `C` default-constructs instances of `A` and `B` then moves over them, rather than using move semantics to initialize them in the first place. – ildjarn May 03 '11 at 07:09
@ildjam: so does it really *move* `a` and `b` into `c`? C++0x is still new to me, but my guess is that there is no sensible cheap *move* operation that can be performed, since the array itself cannot be *moved* from one location to another. This example does not elide the copies from `create_a` and `create_b` into the new `C` object but the original code doesn't either (unless the code is also inlined and then there is greater potential for optimizations). – David Rodríguez - dribeas May 03 '11 at 07:35
@David Rodríguez : My point is that technically only the type's move constructor and move assignment operator are invoked. Even if -- as in this case -- moving rather than copying isn't actually a savings, because the semantics of the way the class is being used are correct, if the class implementation were changed to use heap allocation (e.g. `std::vector`) rather than C-arrays, everything would fall into place perfectly. Point being, even if this particular class is a poor example, the semantics of the way the class is used are correct, which is what I understood the OP to be asking about. – ildjarn May 03 '11 at 07:44
@ildjam: I don't quite agree, by just removing `create_c` and doing `C c = { create_a(), create_b() };` you get to basically the same result *now*, without needing to wait for a later refactor of the types `A` and `B`. That might even be the case already if the compiler is inlining. The semantics *are* already clear enough without needing *rvalue-references*, and a good-old-compiler (implementing the current standard) can already optimize all copies away *now* (by means of inlining). – David Rodríguez - dribeas May 03 '11 at 07:45
The point I am trying to make is not that you should not use move semantics, but rather that there is no point if the objects are not movable, *move* semantics were not initially designed for everyday use, but rather thinking in library implementors, which at the end turns into everyday use as *you* need to provide the *move* constructor if you want to take advantage of it... And in that part, in user code, it is important to know what is *actually movable* and what is not, and neither `A` nor `B` *are* movable. `std::vector` is a move aware container, and using `A` in it will *copy*. – David Rodríguez - dribeas May 03 '11 at 07:46
@David Rodríguez : Aggregate initialization is great for classes that warrant no particular invariant enforcement, but I consider the fact that `class C` is an in-appearance-POD-type to be a side-effect of the fact that the question was slimmed down to the barest amount of code needed to get the point across, moreso than the idea that the OP was specifically asking *only* about POD-types (or in-appearanc-POD-types). In the *general* case where a class cannot use aggregate initialization due to being otherwise non-trivial, the approach in the OP's code is exactly ideal. – ildjarn May 03 '11 at 07:51
@David Rodríguez : And my point is that one *should* use move semantics on every type, even if that type doesn't currently make the best use of them -- that type's implementation could improve at any point in the future, so you should already be using it properly. Going any further than that in terms of class design should be an optimization effort conceded to only after profiling deems it necessary. – ildjarn May 03 '11 at 07:52
@ildjarn - Isn't this just premature optimization (or worse, if you know it doesn't have any effect)? Why try to move something you **know** isn't movable? How do you test the "moving" code that isn't moving? Save that until it is needed! – Bo Persson May 03 '11 at 11:51
@Bo Persson : Because it may become movable some day, and it *isn't hurting anything in the meantime*. An analogy, if you're familiar with C++/CLI: it's idiomatic to use value (stack) semantics whenever possible (possibly using `msclr::auto_handle<>`), for every type possible, even ones that don't implement `IDisposable`. The reason: they may implement `IDisposable` at some point in the future and if you're not using value semantics then your code becomes less than ideal; however, if you use value semantics from the beginning, it's future-proof and *isn't hurting anything in the meantime*. – ildjarn May 03 '11 at 11:55
1

@ildjam - OK, I am not future proof then, I'm lean. :-) – Bo Persson May 03 '11 at 12:36

Can creation of composite objects from temporaries be optimised away?

3 Answers3

Linked