Performance degradation due to default initialisation of elements in standard containers

Question

(Yes, I know there is a question with almost the same title, but the answer was not satisfactory, see below)

EDIT Sorry, the original question didn't use compiler optimization. This is now fixed, but to avoid trivial optimization and to come closer to my actual use case, the test has been split into two compilation units.

The fact that the constructor of std::vector<> has linear complexity is a nuisance when it comes to performance-critical applications. Consider this simple code

// compilation unit 1:
void set_v0(type*x, size_t n)
{
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);                     // default initialisation is wasteful
set_v0(x.data(),n);                         // over-writes initial values

when a significant amount of time is wasted by constructing x. The conventional way around this, as explored by this question, seems to be to merely reserve the storage and use push_back() to fill in the data:

// compilation unit 1:
void set_v1(std::vector<type>&x, size_t n)
{
  x.reserve(n);
  for(size_t i=0; i<n; ++i)
    x.push_back(simple_function(i));
}

// compilation unit 2:
std::vector<type> x(); x.reserve(n);        // no initialisation
set_v1(x,n);                                // using push_back()

However, as indicated by my comment, the push_back() is inherently slow, making this second approach actually slower than the first one for sufficiently simply constructible objects, such as size_ts, when for

simple_function = [](size_t i) { return i; };

I get the following timings (using gcc 4.8 with -O3; clang 3.2 produced ~10% slower code)

timing vector::vector(n) + set_v0();
n=10000 time: 3.9e-05 sec
n=100000 time: 0.00037 sec
n=1000000 time: 0.003678 sec
n=10000000 time: 0.03565 sec
n=100000000 time: 0.373275 sec

timing vector::vector() + vector::reserve(n) + set_v1();
n=10000 time: 1.9e-05 sec
n=100000 time: 0.00018 sec
n=1000000 time: 0.00177 sec
n=10000000 time: 0.020829 sec
n=100000000 time: 0.435393 sec

The speed-up actually possible if one could elide the default construction of elements can be estimated by the following cheating version

// compilation unit 2
std::vector<type> x; x.reserve(n);          // no initialisation
set_v0(x,n);                                // error: write beyond end of vector
                                            // note: vector::size() == 0

when we get

timing vector::vector + vector::reserve(n) + set_v0();          (CHEATING)
n=10000 time: 8e-06 sec
n=100000 time: 7.2e-05 sec
n=1000000 time: 0.000776 sec
n=10000000 time: 0.01119 sec
n=100000000 time: 0.298024 sec

So, my first question: Is there any legal way to use a standard library container which would give these latter timings? Or do I have to resort to manage the memory myself?

Now, what I really want, is to use multi-threading to fill in the container. The naive code (using openMP in this example for simplicity, which excludes clang for the moment)

// compilation unit 1
void set_v0(type*x, size_t n)
{
#pragma omp for                       // only difference to set_v0() from above 
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);               // default initialisation not mutli-threaded
#pragma omp parallel
set_v0(x,n);                          // over-writes initial values in parallel

now suffers from the fact that the default initialization of all elements is not multi-threaded, resulting in an potentially serious performance degradation. Here are the timings for set_omp_v0() and a equivalent cheating method (using my macbook's intel i7 chip with 4 cores, 8 hyperthreads):

timing std::vector::vector(n) + omp parallel set_v0()
n=10000 time: 0.000389 sec
n=100000 time: 0.000226 sec
n=1000000 time: 0.001406 sec
n=10000000 time: 0.019833 sec
n=100000000 time: 0.35531 sec

timing vector::vector + vector::reserve(n) + omp parallel set_v0(); (CHEATING)
n=10000 time: 0.000222 sec
n=100000 time: 0.000243 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.008952 sec
n=100000000 time: 0.089619 sec

Note that the cheat version is ~3.3 times faster than the serial cheat version, roughly as expected, but the standard version is not.

So, my second question: Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?

PS. I found this question, where std::vector is tricked into avoiding the default initialization by providing it with a uninitialized_allocator. This is no longer standard compliant, but works very well for my test case (see my own answer below and this question for details).

Are you able to use emplace_back instead of push_back to construct your objects "in-place" thus saving the construction of a default constructed object? — jcoder, Apr 11 '13 at 15:22
`emplace_back()` makes no difference. The cost here is not that of constructing the element (a trivial `size_t` in my examples), but that of adjusting the vector data members (though this should be merely a `pointer++`). — Walter, Apr 11 '13 at 15:28
If that is the problem, `set_v0` should be fast, shouldn't it? Also, is your default-ctor marked `noexcept`? — Daniel Frey, Apr 11 '13 at 15:30
Did you remove the `std::fill` from `uninitialized_allocator::allocate`? And did you compile with optimizations cranked up? When I run this experiment on clang++/libc++, the loop containing the call to construct gets completely optimized away. It could also be that your `vector` implementation does not yet completely conform to C++11 in this regard. The C++98/03 `vector` was not required to have this functionality. I'm just now noticing that specialized allocator_traits in std. That should not be necessary. — Howard Hinnant, Apr 11 '13 at 15:49
@HowardHinnant yes, I removed the std::fill. However, I didn't use optimization. ... When I do this with the simple examples above, then there is no performance degradation. However, in my actual application, the call to `vector::resize()` and the code that subsequently fills in the vector are in different compilation units, so the optimisation is unlikely to happen there (the latter codes takes a pointer, it does know nothing about the vector). — Walter, Apr 11 '13 at 16:27
@neagoegab a vector has essentially 3 data members: the begin and and of the allocated memory, and the end of the actually used memory. With `vector::reserve()`, you change the capacity=memory allocated, but not the size=actually used memory. If I had used `vector::at()` instead of `[]` in `set_cheat()`, it would have crashed with run-time error. — Walter, Apr 11 '13 at 16:30
@Walter *"However, I didn't use optimization"* - Nrgh, wrong answer! Really, you don't speak about performance when measuring without optimizations. My grandma could fill a 1 MB vector faster than a non-optimizing compiler. If your simple examples get optimized away, you have to complicate them. — Christian Rau, Apr 11 '13 at 16:39
@ChristianRau sorry (agree). I will update my question in due time with optimisation switched on ... — Walter, Apr 11 '13 at 17:06
You should use such allocator only with [`is_trivially_default_constructible`](http://en.cppreference.com/w/cpp/types/is_default_constructible) types. — Evgeny Panasyuk, Apr 11 '13 at 21:07
You can enforce Evgeny Panasyuk's excellent advice by putting `static_assert(std::is_trivially_default_constructible::value, "This allocator can only be used with trivally default constructible types");` in your no-op `construct` function. Also, this function should really be a member template (say on `U`) that constructs a `U` and similarly for the multi-parameter overload of `construct`. And if you make this change, change your `static_assert` to test `U` instead of `T`. — Howard Hinnant, Apr 11 '13 at 21:18
@EvgenyPanasyuk indeed. I was pondering this last night. I will implement it though with SFINAE or by defaulting to ordinary std::vector for other types. — Walter, Apr 12 '13 at 06:10

score 12 · Answer 1 · answered Apr 11 '13 at 15:48

12

With g++ 4.5 I was able to realize an approximate 20% reduction in runtime from v0 (1.0s to 0.8s) and slightly less from 0.95s to 0.8s for v1 by using a generator to construct directly:

struct Generator : public std::iterator<std::forward_iterator_tag, int>
{
    explicit Generator(int start) : value_(start) { }
    void operator++() { ++value_; }
    int operator*() const { return value_; }

    bool operator!=(Generator other) const { return value_ != other.value_; }

    int value_;
};

int main()
{
    const int n = 100000000;
    std::vector<int> v(Generator(0), Generator(n));

    return 0;
}

answered Apr 11 '13 at 15:48

Mark B

95,107
10
109
188

2

Good idea, but the generator should rather be a `random_access_iterator` (since we know the number of elements, that's what the question is all about), as otherwise the vector constructor will have to do an additional loop over the generator to get its size (or use vector growth internally), so there wouldn't be much of a win. With a `random_access_iterator_tag`ed (and implemented) generator, this would be quite an acceptable solution, though (maybe wrapped in a general `generating_iterator` taking a functor). – Christian Rau Apr 11 '13 at 16:24
@Christian Rau I actually thought about that and tried it with the random access iterator tag and provided an `operator-` implementation and it had no measurable effect on the performance whatsoever (much to my surprise). Thus I left it with the simpler implementation. – Mark B Apr 11 '13 at 18:12
4

@MarkB Well, in the end everything this question is about might not have a *practical* performance impact, but it has a clear *theoretical* one and for such questions asking about the last bit of performance, why not get it right in the first place. Of course one might call it *"premature optimization"*, but what else is this whole question about? It's not some localized multiply-vs-shift problem, but an actual difference in algorithmic complexity between two different iterator categories (O(2*n) is always worse than O(n), no matter if measurable or not). – Christian Rau Apr 11 '13 at 19:01

Walter · Accepted Answer · 2017-07-06T15:57:52.933

Okay, here is what I've learned since asking this question.

Q1 (Is there any legal way to use a standard library container which would give these latter timings?) Yes to some degree, as shown in the answers by Mark and Evgeny. The method of providing a generator to the constructor of std::vector elides the default construction.

Q2 (Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?) No, I don't think so. The reason is that on construction any standard-compliant container must initialise its elements, already to ensure that the call to the element destructors (upon destruction or resizing of the container) is well-formed. As the std library containers do not support the usage of multi-threading in constructing their elements, the trick of Q1 cannot be replicated here, so we cannot elide the default construction.

Thus, if we want to use C++ for high-performance computing our options are somewhat limited when it comes to managing large amounts of data. We can

1 declare a container object and, in the same compilation unit, immediately fill it (concurrently), when the compiler hopefully optimizes the initialization at construction away;

2 resort to new[] and delete[] or even malloc() and free(), when all the memory management and, in the latter case, construction of elements is our responsibility and our potential usage of the C++ standard library very limited.

3 trick a std::vector to not initialise its elements using a custom unitialised_allocator that elides the default construction. Following the ideas of Jared Hoberock such an allocator could look like this (see also here):

// based on a design by Jared Hoberock
// edited (Walter) 10-May-2013, 23-Apr-2014
template<typename T, typename base_allocator = std::allocator<T> >
struct uninitialised_allocator
  : base_allocator
{
  static_assert(std::is_same<T,typename base_allocator::value_type>::value,
                "allocator::value_type mismatch");

  template<typename U>
  using base_t =
    typename std::allocator_traits<base_allocator>::template rebind_alloc<U>;

  // rebind to base_t<U> for all U!=T: we won't leave other types uninitialised!
  template<typename U>
  struct rebind
  {
    typedef typename
    std::conditional<std::is_same<T,U>::value,
                     uninitialised_allocator, base_t<U> >::type other; 
  }

  // elide trivial default construction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_default_constructible<U>::value>::type
  construct(U*) {}

  // elide trivial default destruction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_destructible<U>::value>::type
  destroy(U*) {}

  // forward everything else to the base
  using base_allocator::construct;
  using base_allocator::destroy;
};

Then an unitialised_vector<> template could be defined like this:

template<typename T, typename base_allocator = std::allocator<T>>
using uninitialised_vector = std::vector<T,uninitialised_allocator<T,base_allocator>>;

and we can still use almost all of the standard library's functionality. Though it must be said that the uninitialised_allocator, and hence by implication the unitialised_vector are not standard compliant, because its elements are not default constructed (e.g. a vector<int> will not have all 0 after construction).

When using this tool for my little test problem, I get excellent results:

timing vector::vector(n) + set_v0();
n=10000 time: 3.7e-05 sec
n=100000 time: 0.000334 sec
n=1000000 time: 0.002926 sec
n=10000000 time: 0.028649 sec
n=100000000 time: 0.293433 sec

timing vector::vector() + vector::reserve() + set_v1();
n=10000 time: 2e-05 sec
n=100000 time: 0.000178 sec
n=1000000 time: 0.001781 sec
n=10000000 time: 0.020922 sec
n=100000000 time: 0.428243 sec

timing vector::vector() + vector::reserve() + set_v0();
n=10000 time: 9e-06 sec
n=100000 time: 7.3e-05 sec
n=1000000 time: 0.000821 sec
n=10000000 time: 0.011685 sec
n=100000000 time: 0.291055 sec

timing vector::vector(n) + omp parllel set_v0();
n=10000 time: 0.00044 sec
n=100000 time: 0.000183 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.00892 sec
n=100000000 time: 0.088051 sec

timing vector::vector() + vector::reserve() + omp parallel set_v0();
n=10000 time: 0.000192 sec
n=100000 time: 0.000202 sec
n=1000000 time: 0.00067 sec
n=10000000 time: 0.008596 sec
n=100000000 time: 0.088045 sec

when there is no difference any more between the cheating and "legal" versions.

For high performance computing, why do you allocate in the critical path? Any allocation is bad in critical path. You should do all allocation during the initialization of your application — balki, Apr 12 '13 at 10:37
you might need temporary storage, or you may need to increase your grid size (adaptively) as the simulation evolves... — Walter, Apr 12 '13 at 10:50
So, what happens when the container is destroyed with unconstructed elements in it? The destructors are called: can you guarantee that this isn't a problem, or it isn't going to happen? — Yakk - Adam Nevraumont, Apr 12 '13 at 14:36
@Yakk The whole thing only works for trivially default constructible and trivially destructible types. — Walter, Apr 26 '13 at 14:57
See also this [safe to use default-initializing allocator](https://stackoverflow.com/a/21028912/427158). — maxschlepzig, Nov 03 '18 at 14:33

score 6 · Answer 3 · edited Apr 25 '13 at 22:03

`boost::transformed`

For single-thread version, you may use boost::transformed. It has:

Returned Range Category: The range category of rng.

Which mean, that if you would give Random Access Range to boost::transformed, it would return Random Access Range, what would allow vector's constructor to pre-allocate required amount of memory.

You may use it as follows:

const auto &gen = irange(0,1<<10) | transformed([](int x)
{
    return exp(Value{x});
});
vector<Value> v(begin(gen),end(gen));

LIVE DEMO

#define BOOST_RESULT_OF_USE_DECLTYPE 
#include <boost/range/adaptor/transformed.hpp>
#include <boost/container/vector.hpp>
#include <boost/range/irange.hpp>
#include <boost/progress.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <string>
#include <vector>
#include <array>


using namespace std;
using namespace boost;
using namespace adaptors;

#define let const auto&

template<typename T>
void dazzle_optimizer(T &t)
{
    auto volatile dummy = &t; (void)dummy;
}

// _______________________________________ //

using Value = array<int,1 << 16>;
using Vector = container::vector<Value>;

let transformer = [](int x)
{
    return Value{{x}};
};
let indicies = irange(0,1<<10);

// _______________________________________ //

void random_access()
{
    let gen = indicies | transformed(transformer);
    Vector v(boost::begin(gen), boost::end(gen));
    dazzle_optimizer(v);
}

template<bool reserve>
void single_pass()
{
    Vector v;
    if(reserve)
        v.reserve(size(indicies));
    for(let i : indicies)
        v.push_back(transformer(i));
    dazzle_optimizer(v);
}

void cheating()
{
    Vector v;
    v.reserve(size(indicies));
    for(let i : indicies)
        v[i]=transformer(i);
    dazzle_optimizer(v);
}

// _______________________________________ //

int main()
{
    struct
    {
        const char *name;
        void (*fun)();
    } const tests [] =
    {
        {"single_pass, no reserve",&single_pass<false>},
        {"single_pass, reserve",&single_pass<true>},
        {"cheating reserve",&cheating},
        {"random_access",&random_access}
    };
    for(let i : irange(0,3))
        for(let test : tests)
            progress_timer(), // LWS does not support auto_cpu_timer
                (void)i,
                test.fun(),
                cout << test.name << endl;

}

score 1 · Answer 4 · 2017-12-30T08:29:26.670

I'm actually going to suggest in this case to roll your own container or look for alternatives since with the way I see it, your inherent problem is not with standard containers default constructing elements. It's with trying to use a variable-capacity container for one whose capacity can be determined upon construction.

There is no instance where the standard library needlessly default constructs elements. vector only does so for its fill constructor and resize, both of which are conceptually required for a general-purpose container since the point of those is to resize the container to contain valid elements. Meanwhile it's simple enough to do this:

T* mem = static_cast<T*>(malloc(num * sizeof(T)));
for (int j=0; j < num; ++j)
     new (mem + j) T(...); // meaningfully construct T
...
for (int j=0; j < num; ++j)
     mem[j].~T();         // destroy T
free(mem);

... and then build an exception-safe RAII-conforming container out of the code above. And that's what I suggest in your case since if default construction is wasteful enough to be non-negligible in a fill constructor context to the point where the alternative reserve and push_back or emplace_back is equally inadequate, then chances are that even a container treating its capacity and size as a variable is a non-negligible overhead, at which point you are more than justified to seek out something else, including rolling your own thing from the concept above.

The standard library is pretty damned efficient for what it does in ways where it's incredibly difficult to match in apples to apples comparisons, but in this case your needs call for oranges rather than apples. And in such cases, it often starts to become easier to just reach for an orange directly rather than trying to transform an apple to become an orange.

Performance degradation due to default initialisation of elements in standard containers

4 Answers4

`boost::transformed`

LIVE DEMO

Linked