Avoid multiple copy of data when composing objects at construction without dynamic allocation

Question

We have hierarchical messages that are represented by classes. They are used to send messages between threads and components by serializing/deserializing. In our use-case, we use std::variant<InnerA, InnherB, ...>, but to simplify, our code is similar to this:

class Inner {
  public:
    Inner(uint8_t* array, uint16_t arrayLength) {
        m_payloadLength = arrayLength; // Let's assume arrayLength is always < 256
        memcpy(m_payload.data(), array, arrayLength));
    }
    std::array<uint8_t, 256> m_payload;
    uint16_t m_payloadLength;
}

class Outer {
  public:
    Outer(const Inner& inner): m_inner(inner){};
    Inner m_inner;
}

class OuterOuter {
  public:
    OuterOuter(const Outer& outer): m_outer(outer){};
    Outer m_outer;
}

Thus to create an OuterOuter object we need to do

int main(int argc, char** argv){
   uint8_t buffer[4]  = {1,2,3,4};
   Inner inner(buffer, 4);
   Outer outer(inner);
   OuterOuter outerOuter(outer);
   addToThreadQueue(outerOuter);
}

Now the thing is, we use an embedded device so we cannot use dynamic memory with malloc and new. As of now, will the payload content will be copied thrice? Once for the creation of inner, once when the copy constructor of Inner is called in Outer, and once when the copy constructor of Outer is called in OuterOuter? If so, is there a way to avoid all this copying without the use of dynamic memory? If there is a way to pass my intent to the compiler, may be an optimization would be possible, if it's not already optimizing it.

Ideally, we would avoid the OuterOuter class taking all the subclasses construction argument since our tree is pretty deep and we use std::variant. In this example it would be OuterOuter(uint8_t* array, uint16_t arrayLength), Outer(uint8_t* array, uint16_t arrayLength) and then Outer would build Inner.

Yes, the payload is copied thrice. To avoid that (while also avoiding `new` and `malloc`), just have a pointer to the payload content rather than copying the payload into an array member variable. — Eljay, May 28 '21 at 14:19
Thing is, when we append the object to the queue to send to a thread, it's copied since we don't know when it will be processed. The pointer may not be valid when it's processed. — Xavier Groleau, May 28 '21 at 14:21
Consider instead default constructing an `OuterOuter`, then write your data directly to that instance's `Inner`. — François Andrieux, May 28 '21 at 14:22
i kinda confused about your code. in this code why you didn't use move semantic?. — N0ll_Boy, May 28 '21 at 14:25
@N0ll_Boy • there's no resources to move, so move semantic won't be any different than the copy. — Eljay, May 28 '21 at 14:25
@FrançoisAndrieux I think this may be the sane option. Though this may be tricky since we have some std::variant. — Xavier Groleau, May 28 '21 at 14:27
~@N0ll_Boy Doesn't the move semantic needs a pointer to move the resources?~ Eljay already answered — Xavier Groleau, May 28 '21 at 14:28
@Eljay i mean why he didn't define move constructor since it has std::array definitely has better performance. — N0ll_Boy, May 28 '21 at 14:28
@N0ll_Boy Move construction and copy construction are exactly equivalent for `std::array` if the elements are trivial. Moving an `std::array` will copy the whole array. The only way it can make a difference is if the elements have move semantics, like `std::array`. — François Andrieux, May 28 '21 at 14:29
Semantically there may be copies, but compilers are often pretty good at eliding those away. Your example, for instance gets optimized rather nicely: https://gcc.godbolt.org/z/qjnTKfGa1. So there might be nothing to "fix" from a *practical* standpoint. — , May 28 '21 at 14:40
@Frank Thanks, I was wondering if it was easily optimizable or I needed to make the intent more explicit for the compiler. This pretty much answers my question. Though I am open to more answers if there is a way to make it more explicit, or do it cleanly "by hand" to not be compiler dependent. — Xavier Groleau, May 28 '21 at 14:46
@XavierGroleau, The compiler doing a good job on the example does not necessarily mean that it will do so as well for your production code. You need to check for yourself. Even then, if the code is slower than you'd like, it does not necessarily mean that the copies are the culprit, and this just might not be worth the effort. — , May 28 '21 at 14:49
@Frank I definitly will. Also optimized with std::variant for those wondering with gcc on godbolt https://gcc.godbolt.org/z/GTbb7crh8 — Xavier Groleau, May 28 '21 at 14:49

score 2 · Answer 1 · 2021-05-28T15:25:09.623

Generally speaking, modern compilers do a good job at optimizing the construction of class hierarchies that have no side-effect to their construction beyond filling a continuous memory layout.

For example, gcc compiles your sample down to essentially a single class:

main:
  sub rsp, 280
  mov eax, 4
  mov rdi, rsp
  mov WORD PTR [rsp+256], ax
  mov DWORD PTR [rsp], 67305985
  call addToThreadQueue(OuterOuter const&)
  xor eax, eax
  add rsp, 280
  ret

see on godbolt

Even beyond that, compilers are allowed to skip some side-effects in certain scenarios. For example, in the following sample, gcc completely gets rid of a heap allocation via a process called "heap elision".

#include <memory>

extern int foo(int);
extern void bar(int);

struct MyStruct {
    int data;

    MyStruct() {
        auto val = std::make_unique<int>(12); 
        data = foo(*val);
    }
};

int main(int argc, char** argv){
   MyStruct x;
   bar(x.data);
}

becomes:

main:
  sub rsp, 8
  mov edi, 12
  call foo(int)
  mov edi, eax
  call bar(int)
  xor eax, eax
  add rsp, 8
  ret

see on godbolt

Obviously, you need to double check in your own code base, but the usual refrain remains: "Write easy to read and maintain code first, and only if the compiler does a bad job with it should you bother jumping through hoops to optimize it."

[Here is](https://godbolt.org/z/1jqdPYxze) how asm looks like when compiler can't look into Inner's ctor (or when it has side effect that can't be optimized away). — C.M., May 28 '21 at 15:57
Specifically, the compiler has no choice but to treat a function defined in another TU as having every possible side-effect. — , May 28 '21 at 16:19
yes, and you end up with a bunch of memory copies as a result. And yet, language has a solution that allows you to construct objects in-place (right where you need them to be constructed). — C.M., May 28 '21 at 16:20

C.M. · Accepted Answer · 2021-05-28T17:12:40.350

1

You can use inplacer (see this post). Your code will look like this:

#include <type_traits>
#include <array>
#include <cstdint>
#include <cstring>


using namespace std;


template<class F>
struct inplacer
{
    F f_;
    operator std::invoke_result_t<F&>() { return f_(); }
};

template<class F> inplacer(F) -> inplacer<F>;


struct Inner
{
    Inner(uint8_t* data, size_t len)
        : len_(len) // Let's assume arrayLength is always < 256
    {
        memcpy(payload_.data(), data, len*sizeof(*data));
    }

    std::array<uint8_t, 256>    payload_;
    size_t                      len_;
};

struct Outer
{
    template<class T>
    Outer(T&& inner): m_inner(std::forward<T>(inner)) {}

    Inner m_inner;
};

struct OuterOuter
{
    template<class T>
    OuterOuter(T&& outer): m_outer(std::forward<T>(outer)) {}

    Outer m_outer;
};


void addToThreadQueue(OuterOuter const&);

int main()
{
    uint8_t buffer[4]  = {1,2,3,4};
    OuterOuter outerOuter{ inplacer{[&]{ return Inner{buffer, size(buffer)}; }} };
    addToThreadQueue(outerOuter);
    return 0;
}

This approach will make you less dependent on compiler optimizations. It will also work if your ctors have side effects (or are unavailable for compiler to analyze in this translation unit).

main:
        sub     rsp, 280
        mov     rdi, rsp
        mov     DWORD PTR [rsp], 67305985
        mov     QWORD PTR [rsp+256], 4
        call    addToThreadQueue(OuterOuter const&)
        xor     eax, eax
        add     rsp, 280
        ret

Edit: here is similar solution (but without inplacer) -- it won't work with aggregates, but I bet in your case it is not a problem.

edited May 28 '21 at 17:12

answered May 28 '21 at 15:26

C.M.

3,071
1
14
33

While this is a great and clean solution to the problem and I would like to use it, it is not usable in my context since I cannot use heap allocation because our code runs on an embedded controller. The lambda `[&]{ return Inner{buffer, size(buffer)};` captures the current context which may or may not allocate heap depending on the compiler and the size of the context. We don't use lambda with context since we just want to be sure to never use the heap. – Xavier Groleau May 28 '21 at 16:05
Maybe passing context/arguments via the constructor of the placer with variadic templates? I'm not sure if it's feasible though, I'm just throwing an idea. – Xavier Groleau May 28 '21 at 16:08
@XavierGroleau Are you sure about potential heap allocation for context capture? I've never heard about this. If you want you can replace lambda with your own trivial closure object (that returns an instance of `Inner`) -- this will guarantee "no heap". – C.M. May 28 '21 at 16:10
@XavierGroleau [Here is](https://godbolt.org/z/61MWrjPdM) solution with arguments forwarding, but it is more limiting than inplacer. I remain highly skeptical of your lambda-related potential heap allocations claim... – C.M. May 28 '21 at 16:18
We had this discussion in the team and this was my [source](https://stackoverflow.com/questions/46163607/avoid-memory-allocation-with-stdfunction-and-member-function). Though after rereading, this was mostly for std::function and I don't know if this is applicable to lambda. Maybe you are right and lambda don't necessarily allocate heap even if they have context. Though yea, with this I could probably create a factory with my own closure object just to be sure. A bit like java factory, only using the stack though. This solution is starting to be interesting – Xavier Groleau May 28 '21 at 16:20
1

@XavierGroleau Yes, std::function uses heap (unless context is small enough). Lambdas never do that – C.M. May 28 '21 at 16:21
This is really neat, but you need to keep in mind that those perfect forwarding single-parameter constructors have a nasty tendency to interfere with copy and move construction. – May 28 '21 at 16:22
@Frank nothing "Outer(Outer const&) = default;" can't handle – C.M. May 28 '21 at 16:23
@C.M. Unfortunately, that causes "ambiguity" (not officialy, nut essentially), and you need to SFINAE those away from the forwarding constructor as well: https://godbolt.org/z/fWKT63re9 – May 28 '21 at 16:27
@Frank Yes, you are correct. I didn't take my time to think over it. Maybe it make sense to add `explicit` or additional `tag` argument in these ctors. Tbh, I used this technique a lot and never run into this problem. – C.M. May 28 '21 at 16:35
@C.M. fwiw, the correct constructor should look like this: https://godbolt.org/z/dcT6513fh – May 28 '21 at 16:43
2

@Frank Thanks! I improved it [a bit further](https://godbolt.org/z/e4Kfva95T) :) – C.M. May 28 '21 at 16:46
1

@Frank There is another way to "fix" this [without SFINAE](https://godbolt.org/z/fqvaeM5z7), but it makes MSVC unhappy... – C.M. May 28 '21 at 16:53
1

@XavierGroleau I've added another solution to my answer (see edit at the bottom) -- you probably will find it more appropriate/concise for your case. – C.M. May 28 '21 at 17:16
Thanks, this was very insightful and complete! I've set your answer as the solution to the post. – Xavier Groleau May 28 '21 at 17:36
@Frank you'll find [this](https://stackoverflow.com/questions/67744445/multiple-versions-of-a-defaulted-special-member-functions-are-not-allowed-msvc) interesting. Looks like problem can be solved without SFINAE, but MSVC is in disagreement with standard. – C.M. Jun 01 '21 at 18:57

Avoid multiple copy of data when composing objects at construction without dynamic allocation

2 Answers2