Library for batching allocation

Question

So I'm currently refactoring a giant function:

int giant_function(size_t n, size_t m, /*... other parameters */) {
  int x[n]{};
  float y[n]{};
  int z[m]{};
  /* ... more array definitions */

And when I find a group of related definitions with discrete functionality, grouping them into a class definition:

class V0 {
  std::unique_ptr<int[]> x;
  std::unique_ptr<float[]> y;
  std::unique_ptr<int[]> z;
public:
  V0(size_t n, size_t m)
    : x{new int[n]{}}
    , y{new float[n]{}}
    , z{new int[m]{}}
  {}
  // methods...
}

The refactored version is inevitably more readable, but one thing I find less-than-satisfying is the increase in the number of allocations.

Allocating all those (potentially very large) arrays on the stack is arguably a problem waiting to happen in the unrefactored version, but there's no reason that we couldn't get by with just one larger allocation:

class V1 {
  int* x;
  float* y;
  int* z;
public:
  V1(size_t n, size_t m) {
    char *buf = new char[n*sizeof(int)+n*sizeof(float)+m*sizeof(int)];
    x = (int*) buf;
    buf += n*sizeof(int);
    y = (float*) buf;
    buf += n*sizeof(float);
    z = (int*) buf;
  }
  // methods...
  ~V0() { delete[] ((char *) x); }
}

Not only does this approach involve a lot of manual (read:error-prone) bookkeeping, but its greater sin is that it's not composable.

If I want to have a V1 value and a W1 value on the stack, then that's one allocation each for their behind-the-scenes resources. Even simpler, I'd want the ability to allocate a V1 and the resources it points to in a single allocation, and I can't do that with this approach.

What this initially led me to was a two-pass approach - one pass to calculate how much space was needed, then make one giant allocation, then another pass to parcel out the allocation and initialize the data structures.

class V2 {
  int* x;
  float* y;
  int* z;
public:
  static size_t size(size_t n, size_t m) {
    return sizeof(V2) + n*sizeof(int) + n*sizeof(float) + m*sizeof(int);
  }
  V2(size_t n, size_t m, char** buf) {
    x = (int*) *buf;
    *buf += n*sizeof(int);
    y = (float*) *buf;
    *buf += n*sizeof(float);
    z = (int*) *buf;
    *buf += m*sizeof(int);
  }
}  
// ...
size_t total = ... + V2::size(n,m) + ...
char* buf = new char[total];
// ...
void*  here = buf;
buf += sizeof(V2);
V2* v2 = new (here) V2{n, m, &buf};

However that approach had a lot of repetition at a distance, which is asking for trouble in the long run. Returning a factory got rid of that:

class V3 {
  int* const x;
  float* const y;
  int* const z;
  V3(int* x, float* y, int* z) : x{x}, y{y}, z{z} {}
public:
  class V3Factory {
    size_t const n;
    size_t const m;
  public:
    Factory(size_t n, size_t m) : n{n}, m{m};
    size_t size() {
      return sizeof(V3) + sizeof(int)*n + sizeof(float)*n + sizeof(int)*m;
    }
    V3* build(char** buf) {
      void * here = *buf;
      *buf += sizeof(V3);
      x = (int*) *buf;
      *buf += n*sizeof(int);
      y = (float*) *buf;
      *buf += n*sizeof(float);
      z = (int*) *buf;
      *buf += m*sizeof(int);
      return new (here) V3{x,y,z};
    }
  }
}
// ...
V3::Factory v3factory{n,m};
// ...
size_t total = ... + v3factory.size() + ...
char* buf = new char[total];
// ..
V3* v3 = v3factory.build(&buf);

Still some repetition, but the params only get input once. And still a lot of manual bookkeeping. It'd be nice if I could build this factory out of smaller factories...

And then my haskell brain hit me. I was implementing an Applicative Functor. This could totally be nicer!

All I needed to do was write some tooling to automatically sum sizes and run build functions side-by-side:

namespace plan {

template <typename A, typename B>
struct Apply {
  A const a;
  B const b;
  Apply(A const a, B const b) : a{a}, b{b} {};

  template<typename ... Args>
  auto build(char* buf, Args ... args) const {
    return a.build(buf, b.build(buf + a.size()), args...);
  }

  size_t size() const {
    return a.size() + b.size();
  }

  Apply(Apply<A,B> const & plan) : a{plan.a}, b{plan.b} {}
  Apply(Apply<A,B> const && plan) : a{plan.a}, b{plan.b} {}

  template<typename U, typename ... Vs>
  auto operator()(U const u, Vs const ... vs) const {
    return Apply<decltype(*this),U>{*this,u}(vs...);
  }

  auto operator()() const {
    return *this;
  }
};
template<typename T>
struct Lift {
  template<typename ... Args>
  T* build(char* buf, Args ... args) const {
    return new (buf) T{args...};
  }
  size_t size() const {
    return sizeof(T);
  }
  Lift() {}
  Lift(Lift<T> const &) {}
  Lift(Lift<T> const &&) {}

  template<typename U, typename ... Vs>
  auto operator()(U const u, Vs const ... vs) const {
    return Apply<decltype(*this),U>{*this,u}(vs...);
  }

  auto operator()() const {
    return *this;
  }
}; 

template<typename T>
struct Array {
  size_t const length;
  Array(size_t length) : length{length} {}
  T* build(char* buf) const {
    return new (buf) T[length]{};
  }
  size_t size() const {
    return sizeof(T[length]);
  }
};

template <typename P>
auto heap_allocate(P plan) {
  return plan.build(new char[plan.size()]);
}

}

Now I could state my class quite simply:

class V4 {
  int* const x;
  float* const y;
  int* const z;

public:
  V4(int* x, float* y, int* z) : x{x}, y{y}, z{z} {}

  static auto plan(size_t n, size_t m) {
    return plan::Lift<V4>{}(
      plan::Array<int>{n},
      plan::Array<float>{n},
      plan::Array<int>{m}
    );
  }
};

And use it in a single pass:

V4* v4;
W4* w4;
std::tie{ ..., v4, w4, .... } = *plan::heap_allocate(
  plan::Lift<std::tie>{}(
    // ...
    V4::plan(n,m),
    W4::plan(m,p,2*m+1),
    // ...
  )
);

It's not perfect (among other issues, I need to add code to track destructors, and have heap_allocate return a std::unique_ptr that calls all of them), but before I went further down the rabbit hole, I thought I should check for pre-existing art.

For all I know, modern compilers may be smart enough to recognize that the memory in V0 always gets allocated/deallocated together and batch the allocation for me.

If not, is there a preexisting implementation of this idea (or a variation thereof) for batching allocations with an applicative functor?

I have a concern about the types and alignments of the arrays you are using. `new char[some_size]` does not align the storage for the type you want. If they are all the same alignment then it's fairly trivial to fix but if they are not I'm not sure they can be placed in the same buffer, at least not without some extra math to get the right starting address and size for the buffer. — NathanOliver, Apr 02 '18 at 21:05
Operator `new` does not allocate on the stack. But why aren't you using `std::vector` which takes care of all that, no muss, no fuss? Almost all of that code disappears in a puff of C++. — Jive Dadson, Apr 02 '18 at 21:09
NathanOliver: That's a good point! And something I'd hope a good implementation would have already considered, which makes me wish even harder that one exists. — rampion, Apr 02 '18 at 21:10
@JiveDadson: I didn't mean to imply that operator `new` did - would you help me out by telling me where I did so? I don't want to be unclear. Using `std::vector` would be analogous to my `V0` example - std::vector still has allocations behind the scenes. — rampion, Apr 02 '18 at 21:12
@JiveDadson: A fantastic principle. And yet, as there are many cases when reducing the number of allocations *is* an appropriate optimization, I'm still curious as to whether such a library already exists. — rampion, Apr 02 '18 at 21:16
By the way, the original function is ill-formed in standard C++, since the size of an automatic array must be a compile time constant. — eerorika, Apr 02 '18 at 21:16
Why think that `std::vector` is extravagant in its allocations, and that a different library could do better? Library writers obsess over such things. If you know in advance how much needs to be allocated, use `std::vector::reserve()`.Just use `std::vector`. Don't be so hard on yourself. — Jive Dadson, Apr 02 '18 at 21:21
Pay attention to `decltype(*this)`, it's a lvalue ref. Your code is continuously storing reference of temporary objects. — llllllllll, Apr 02 '18 at 21:41

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

First, I'd like to provide feedback on problems with your solutions:

You ignore alignment. Relying on assumption that int and float share the same alignment on your system, your particular use case might be "fine". But try to add some double into the mix and there will be UB. You might find your program crashing on ARM chips due to unaligned access.
new (buf) T[length]{}; is unfortunately bad and non-portable. In short: Standard allows the compiler to reserve initial y bytes of the given storage for internal use. Your program fails to allocate this y bytes on systems where y > 0 (and yes, those systems apparently exist; VC++ does this allegedly).

Having to allocate for y is bad, but what makes array-placement-new unusable is the inability to find out how big y is until placement new is actually called. There's really no way to use it for this case.
You're already aware of this, but for completeness: You don't destroy the sub-buffers, so if you ever use a non-trivially-destructible type, then there will be UB.

Solutions:

Allocate extra alignof(T) - 1 bytes for each buffer. Align start of each buffer with std::align.
You need to loop and use non-array placement new. Technically, doing non-array placement new means that using pointer arithmetic on these objects has UB, but standard is just silly in this regard and I choose to ignore it. Here's language lawyerish discussion about that. As I understand, p0593r2 proposal includes a resolution to this technicality.
Add destructor calls that correspond to placement-new calls (or static_assert that only trivially destructible types shall be used). Note that support for non-trivial destruction raises the need for exception safety. If construction of one buffer throws an exception, then the sub buffers that were constructed earlier need to be destroyed. Same care need to be taken when constructor of single element throws after some have already been constructed.

I don't know of prior art, but how about some subsequent art? I decided to take a stab at this from a slightly different angle. Be warned though, this lacks testing and may contain bugs.

buffer_clump template for constructing / destructing objects into an external raw storage, and calculating aligned boundaries of each sub-buffer:

#include <cstddef>
#include <memory>
#include <vector>
#include <tuple>
#include <cassert>
#include <type_traits>
#include <utility>

// recursion base
template <class... Args>
class buffer_clump {
protected:
    constexpr std::size_t buffer_size() const noexcept { return 0; }
    constexpr std::tuple<> buffers(char*) const noexcept { return {}; }
    constexpr void construct(char*) const noexcept { }
    constexpr void destroy(const char*) const noexcept {}
};

template<class Head, class... Tail>
class buffer_clump<Head, Tail...> : buffer_clump<Tail...> {
    using tail = buffer_clump<Tail...>;
    const std::size_t length;
    
    constexpr std::size_t size() const noexcept
    {
        return sizeof(Head) * length + alignof(Head) - 1;
    }
    
    constexpr Head* align(char* buf) const noexcept
    {
        void* aligned = buf;
        std::size_t space = size();
        assert(std::align(
            alignof(Head),
            sizeof(Head) * length,
            aligned,
            space
        ));
        return (Head*)aligned;
    }
    
    constexpr char* next(char* buf) const noexcept
    {
        return buf + size();
    }
    
    static constexpr void
    destroy_head(Head* head_ptr, std::size_t last)
    noexcept(std::is_nothrow_destructible<Head>::value)
    {
        if constexpr (!std::is_trivially_destructible<Head>::value)
            while (last--)
                head_ptr[last].~Head();
    }
    
public:
    template<class... Size_t>
    constexpr buffer_clump(std::size_t length, Size_t... tail_lengths) noexcept
    : tail(tail_lengths...), length(length) {}
    
    constexpr std::size_t
    buffer_size() const noexcept
    {
        return size() + tail::buffer_size();
    }
    
    constexpr auto
    buffers(char* buf) const noexcept
    {
        return std::tuple_cat(
            std::make_tuple(align(buf)), 
            tail::buffers(next(buf))
        );
    }
    
    void
    construct(char* buf) const
    noexcept(std::is_nothrow_default_constructible<Head, Tail...>::value)
    {
        Head* aligned = align(buf);
        std::size_t i;
        try {
            for (i = 0; i < length; i++)
                new (&aligned[i]) Head;
            tail::construct(next(buf));
        } catch (...) {
            destroy_head(aligned, i);
            throw;
        }
    }
    
    constexpr void
    destroy(char* buf) const
    noexcept(std::is_nothrow_destructible<Head, Tail...>::value)
    {
        tail::destroy(next(buf));
        destroy_head(align(buf), length);
    }
};

A buffer_clump_storage template that leverages buffer_clump to construct sub-buffers into a RAII container.

template <class... Args>
class buffer_clump_storage {
    const buffer_clump<Args...> clump;
    std::vector<char> storage;
    
public:
    constexpr auto buffers() noexcept {
        return clump.buffers(storage.data());
    }
    
    template<class... Size_t>
    buffer_clump_storage(Size_t... lengths)
    : clump(lengths...), storage(clump.buffer_size())
    {
        clump.construct(storage.data());
    }
    
    ~buffer_clump_storage()
    noexcept(noexcept(clump.destroy(nullptr)))
    {
        if (storage.size())
            clump.destroy(storage.data());
    }

    buffer_clump_storage(buffer_clump_storage&& other) noexcept
    : clump(other.clump), storage(std::move(other.storage))
    {
        other.storage.clear();
    }
};

Finally, a class that can be allocated as automatic variable and provides named pointers to sub-buffers of buffer_clump_storage:

class V5 {
    // macro tricks or boost mpl magic could be used to avoid repetitive boilerplate
    buffer_clump_storage<int, float, int> storage;
    
public:
    int* x;
    float* y;
    int* z;
    V5(std::size_t xs, std::size_t  ys, std::size_t zs)
    : storage(xs, ys, zs)
    {
        std::tie(x, y, z) = storage.buffers();
    }
};

And usage:

int giant_function(size_t n, size_t m, /*... other parameters */) {
    V5 v(n, n, m);
    for(std::size_t i = 0; i < n; i++)
        v.x[i] = i;

In case you need only the clumped allocation and not so much the ability to name the group, this direct usage avoids pretty much all boilerplate:

int giant_function(size_t n, size_t m, /*... other parameters */) {
    buffer_clump_storage<int, float, int> v(n, n, m);
    auto [x, y, z] = v.buffers();

Criticism of my own work:

I didn't bother making V5 members const which would have arguably been nice, but I found it involved more boilerplate than I would prefer.
Compilers will warn that there is a throw in a function that is declared noexcept when the constructor cannot throw. Neither g++ nor clang++ were smart enough to understand that the throw will never happen when the function is noexcept. I guess that can be worked around by using partial specialization, or I could just add (non-standard) directives to disable the warning.
buffer_clump_storage could be made copyable and assignable. This involves loads more code, and I wouldn't expect to have need for them. The move constructor might be superfluous as well, but at least it's efficient and concise to implement.

Awesome! Definitely going to play with this. Thanks so much! — rampion, Apr 06 '18 at 15:13

Library for batching allocation

1 Answers1