Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?

Question

EDIT: I added two more benchmarks, to compare the use of realloc with the C array and of reserve() with the std::vector. From the last analysis it seems that realloc influences a lot, even if called only 30 times. Checking the documentation I guess this is due to the fact that realloc can return a completely new pointer, copying the old one. To complete the scenario I also added the code and graph for allocating completely the array during the initialisation. The difference from reserve() is tangible.

Compile flags: only the optimisation described in the graph, compiling with g++ and nothing more.

Original question:

I made a benchmark of std::vector vs a new/delete array, when I add 1 billion integers and the second code is dramatically faster than the one using the vector, especially with the optimisation turned on.

I suspect that this is caused by the vector internally calling realloc too many times. This would be the case if vector does not grows doubling its size every time it gets filled (here the number 2 has nothing special, what matters is that its size grows geometrically). In such a case the calls to realloc would be only O(log n) instead of O(n).

If this is what causes the slowness of the first code, how can I tell std::vector to grow geometrically?

Note that calling reserve once would work in this case but not in the more general case in which the number of push_back is not known in advance.

black line

#include<vector>

int main(int argc, char * argv[]) {
    const unsigned long long size = 1000000000;

    std::vector <int> b(size);
    for(int i = 0; i < size; i++) {
        b[i]=i;
    }    
    return 0;
}

blue line

#include<vector>

int main(int argc, char * argv[]) {
    const int size = 1000000000;    
    std::vector <int> b;
    for(int i = 0; i < size; i++) {
        b.push_back(i);
    }    

    return 0;
}

green line

#include<vector>

int main(int argc, char * argv[]) {
    const int size = 1000000000;
    std::vector <int> b;
    b.reserve(size);
    for(int i = 0; i < size; i++) {
        b.push_back(i);
    }    

    return 0;
}

red line

int main(int argc, char * argv[]) {
    const int size = 1000000000;
    int * a = new int [size];
    for(int i = 0; i < size; i++) {
        a[i] = i;
    }
    delete [] a;   
    return 0;
}

orange line

#include<vector>

int main(int argc, char * argv[]) {
    const unsigned long long size = 1000000000;
    int * a = (int *)malloc(size*sizeof(int));
    int next_power = 1;
    for(int i = 0; i < size; i++) {
        a[i] = i;
        if(i == next_power - 1) {
            next_power *= 2;
            a=(int*)realloc(a,next_power*sizeof(int));
        }
    }
    free(a);
    return 0;
}

EDIT: checking .capacity(), as suggested, we saw that the growth is indeed exponential. So why the vector is so slow?

you can check how the vector allocates memory by checking the `.capacity()` — kmdreko, Apr 02 '18 at 16:40
How much a vector allocates after size outgrows capacity is not specified by the standard AFAIK and thus implementation defined — Hatted Rooster, Apr 02 '18 at 16:43
It does grow geometrically according to my testing. Printing out `b.capacity()`, I get powers of 2. Using GCC. — eesiraed, Apr 02 '18 at 16:45
It's always geometrical. But copying is expensive in this case, even it's geometrical. — llllllllll, Apr 02 '18 at 16:47
Yes, re-allocating memory as size grows is slower than allocating a single block up front. This is not surprising, and this "benchmark" simply confirms what should be obvious. — Pete Becker, Apr 02 '18 at 16:47
While the factor they use (e.g. 1.5x vs. 2x) can vary, a conforming implementation of `vector` must always use a geometric growth rate. `push_back` has amortized constant complexity, which can't be accomplished with an arithmetic growth rate. — Jerry Coffin, Apr 02 '18 at 16:47
Yes it grows geometrically with a power of two in my machine too, so the slowness is not due to malloc!!! — Nisba, Apr 02 '18 at 16:48
even if it doubles every time, the last step is going from ~2GB to ~4GB has to copy ~2GB of integers to the new space. which does take time — kmdreko, Apr 02 '18 at 16:50
If you don't know how much to `reserve()`, how do you know how much to `new[]`? — Bo Persson, Apr 02 '18 at 16:56
You are comparing apples and oranges. If you want the code to have the same behavior you need `b.reserve(size)` which is basically the same as `int * a = new int [size];`. — NathanOliver, Apr 02 '18 at 16:57
@BoPersson the tests (exponential growth) sustains that the slowness is not due to calling recalling too many times, so one probably would benefit by rewriting an ad-hoc pure C array, like I did once as an exercise — Nisba, Apr 02 '18 at 16:58
@Nisba If you are using clang, your C-style example compiles into nothing - compiler removes **all** code. — , Apr 02 '18 at 17:14
The red test case is [completely optimized out by gcc](https://godbolt.org/g/b8b1HW). Doing anything is slower than doing nothing. — François Andrieux, Apr 02 '18 at 17:14
It's easier to benchmark things using benchmarking libraries like Google Benchmark. You can do that online at [quick-bench.com](http://quick-bench.com/) — Justin, Apr 02 '18 at 17:15
@Nisba Your benchmark is very wrong. Optimizer can just optimize away **all** code. You should use `volatile int * a = new int [size];` in your C version. — llllllllll, Apr 02 '18 at 17:17
@liliscent - Using `volatile` only makes the volatile- qualified thingy ordered with respect to observable side effects. It is more direct simply to create a visible side effect, like for example, returning the vector's final size from main(). For anything other than reading/writing memory-mapped hardware and the like, `volatile` is intrinsically a _dirty word_. It does not mean what 99.9% of programmers think it does. — Jive Dadson, Apr 02 '18 at 17:23
@Nisba You'll have to double check the assembly but I think what you are seeing here is the compiler optimizing the loop away with the array since it doesn't actually do anything. — NathanOliver, Apr 02 '18 at 17:24
Since a vector stores its contents in contiguous memory, no different than an array, the vector code can be written the same way as the array code by just using a pointer and incrementing the pointer. So what's being proven here? If you want "pointer speed", then it can be achieved even if you use a vector. — PaulMcKenzie, Apr 02 '18 at 17:25
@NathanOliver I added `b[i]=i` for this purpose, so it is not enough? — Nisba, Apr 02 '18 at 17:25
@JiveDadson But here is the correct use of `volatile`, it forces those assignment being actually performed. Many people misusing it doesn't mean it's evil in every case. — llllllllll, Apr 02 '18 at 17:25
@Nisba Nope. Doing a `std::cout << whatever[size - 1];` should do it or making the variable volatile will do it as well. — NathanOliver, Apr 02 '18 at 17:26
@PaulMcKenzie I wanted to know if the worst performance of C style array was mainly due to realloc — Nisba, Apr 02 '18 at 17:26
@liliscent - I just prefer not to speak the name in mixed company. Think of _the children_. :-) Other observable side-effects will do just as well. — Jive Dadson, Apr 02 '18 at 17:31
One thing: AFAIK, vector is not allowed to use `realloc`, due to requirements on allocator usage. I guess in theory, there could be a specialization for the default allocator of Trivally-Copyable types, but I wouldn't think any standard library implementation does that. It would be more fair to compare vs a `malloc`, `memcpy`, `free` — Justin, Apr 02 '18 at 17:35
Always post your compiler flags when doing something like this. — Jesper Juhl, Apr 02 '18 at 17:50
@Justin: In fact, `operator new` is user-replaceable, which means you need whole-program optimization and cooperation between compiler and library for `std::vector` to be able to use `realloc` instead of stupidly always copying, even if it could have just left 1GiB of data in place and `mmap`ed new memory at the end. But that's hard so AFAIK no real implementations do this optimization. Similarly, they don't optimize a zero-initializer to `calloc` to take advantage of already-zeroed memory, instead always dirtying it. Makes it more expensive to alloc extra then shrink to fit. — Peter Cordes, Apr 05 '18 at 07:22
@PeterCordes Indeed. I did think of user-defined `operator new`, but didn't think it was necessary to go into such details. I do wonder if it's undefined behavior to write a user-defined `operator new` for primitive types. If so, it would become doable to have the optimization for `vector` and the like. — Justin, Apr 05 '18 at 07:30
@Justin: It's not UB, so even primitive types can't have efficient zero-init or realloc for `std::vector` without whole-program optimization. BTW, [Is it guaranteed that C++ standard library containers call the replaceable new functions?](https://stackoverflow.com/questions/46823224/is-it-guaranteed-that-c-standard-library-containers-call-the-replaceable-new-f) was asked specifically to find out if this optimization would be possible for a C++ library without checking if `new` was replaced. — Peter Cordes, Apr 05 '18 at 07:53

score 15 · Answer 1 · edited Apr 02 '18 at 17:45

The optimized C style array is optimized to nothing.

On godbolt:

xorl %eax, %eax
retq

that is the program.

Whenever you have a program optimized to nearly 0s you should consider this possibility.

The optimizer sees you are doing nothing with the memory allocated, notes that unused allocating memory may have zero side effects, and eliminates the allocation.

And writing to memory then never reading it also has zero side effects.

In comparison, the compiler has difficulty proving that the vector's allocation is useless. Probably the compiler developers could teach it to recognize unused std vectors like they recognize unused raw C arrays, but that optimization really is a corner case, and it causes lots of problems profiling in my experience.

Note that the vector-with-reserve at any optimization level is basically the same speed as the unoptimized C style version.

In the C style code, the only thing to optimize is "don't do anything". In the vector code, the unoptimized version is full of extra stack frames and debug checks to ensure you don't go out of bounds (and crash cleanly if you do).

Note that on a Linux system, allocating huge chunks of memory doesn't do anything except fiddle with the virtual memory table. Only when the memory is touched does it actually find some zero'd physical memory for you.

Without reserve, the std vector has to guess an initial small size, resize it an copy it, and repeat. This causes a 50% performance loss, which seems reasonable to me.

With reserve, it actually does the work. The work takes just under 5s.

Adding to vector via push back does causes it to grow geometrically. Geometric grows results in an asymptotic average of 2-3 copies of each piece of data being made.

As for realloc, std::vector does not realloc. It allocates a new buffer, and copies the old data, then discards the old one.

Realloc attempts to grow the buffer, and if it cannot it bitwise copies the buffer.

That is more efficient than std vector can manage for bitwise copyable types. I'd bet the realloc version actually never copies; there is always memory space to grow the vector into (in a real program this may not be the case).

The lack of realloc in std library allocators is a minor flaw. You'd have to invent a new API for it, because you'd want it to work for non-bitwise copy (something like "try grow allocated memory", which if it fails leaves it up to you to grow the allocation).

The simplest of questions often attract the greatest answers, which expose the subtle complexities of things that appear on the surface, trivial. +1 — Richard Hodges, Apr 02 '18 at 17:45
If the OP added something outside the loop to stop the optimizer from from fully discarding the work (e.g. an asm statement, or a call to a non-inline function in another file, with link-time optimization disabled), then you'd see optimizations short of doing nothing: i.e. using SIMD to store 16 or 32 bytes at a time, instead of just one `int` at a time. It can't optimize the loop into call to `memset`, but SIMD `v={0,1,2,3}` ; `v+={4,4,4,4}` is a win. (The copies inside `std::vector` cost most of the time, and should be using `memcpy`, though, or an equivalent optimized implementation.) — Peter Cordes, Apr 29 '18 at 04:26
Or with whole-program optimization, using `std::vector` could in theory use `realloc` for bitwise-copyable types when `operator new` isn't overloaded. (AFAIK, no real compiler+libc++ implementations do this, unfortunately.) Not being able to take advantage of just mapping new pages onto the end of an existing large allocation is dumb, and on the OP's hardware is a factor of 2 missed optimization. (In real programs, not polluting cache with a giant memcpy is nice, too.) — Peter Cordes, Apr 29 '18 at 04:30
IIRC, clang + [libc++](https://libcxx.llvm.org/) can sometimes optimize away allocation / deallocation with C++ containers. I didn't check for this loop, though, and I haven't checked g++ with libc++ (only the usual libstdc++) — Peter Cordes, Apr 29 '18 at 04:36

Barry · Answer 2 · 2018-04-02T17:08:36.417

when I add 1 billion integers and the second code is dramatically faster than the one using the vector

That's... completely unsurprising. One of your cases involves a dynamically sized container that has to readjust for its load, and the other involves a fixed size container that doesn't. The latter simply has to do way less work, no branching, no additional allocations. The fair comparison would be:

std::vector<int> b(size);
for(int i = 0; i < size; i++) {
    b[i] = i;
}

This now does the same thing as your array example (well, almost - new int[size] default-initializes all the ints whereas std::vector<int>(size) zero-initializes them, so it's still more work).

It doesn't really make sense to compare these two to each other. If the fixed-size int array fits your use case, then use it. If it doesn't, then don't. You either need a dynamically sized container or not. If you do, performing slower than a fixed-size solution is something you're implicitly giving up.

If this is what causes the slowness of the first code, how can I tell std::vector to grow geometrically?

std::vector is already mandated to grow geometrically already, it's the only way to maintain O(1) amortized push_back complexity.

@Nisba `std::vector b(size);` makes vector of 1 billion size, reserve is irrelevant here. — Killzone Kid, Apr 02 '18 at 17:16
@Nisba you didn't change your graph, it makes no sense that assigning elements to resized vector takes longer than pushing back with dynamic resizing — Killzone Kid, Apr 02 '18 at 18:01

eerorika · Answer 3 · 2018-04-02T17:59:35.220

Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?

~~Your test neither supports that conclusion, nor does it prove the opposite. However, I would assume that reallocation is called linear number of times unless there is contrary evidence.~~

Update: Your new test is apparently evidence against your non-logarithmic reallocation hypothesis.

I suspect that this is caused by the vector internally calling realloc too many times.

Update: Your new test shows that some of the difference is due to reallocations... but not all. I suspect that the remainder is due to the fact that optimizer was able to prove (but only in the case of the non-growing) that the array values are unused, and chose to not loop and write them at all. If you were to make sure that the written values are actually used, then I would expect that the non-growing array would have similar optimized performance to the reserving vector.

The difference (between reserving code and non-reserving vector) in optimized build is most likely due to doing more reallocations (compared to no reallocations of the reserved array). Whether the number of reallocations is too much is situational and subjective. The downside of doing fewer reallocations is more wasted space due to overallocation.

Note that the cost of reallocation of large arrays comes primarily from copying of elements, rather than memory allocation itself.

In unoptimized build, there is probably additional linear overhead due to function calls that weren't expanded inline.

how can I tell std::vector to grow geometrically?

Geometric growth is required by the standard. There is no way and no need to tell std::vector to use geometric growth.

Note that calling reserve once would work in this case but not in the more general case in which the number of push_back is not known in advance.

However, a general case in which the number of push_back is not known in advance is a case where the non-growing array isn't even an option and so its performance is irrelevant for that general case.

My test about the geometric growth of the size supports that realloc is not the cause of the slowness, since it is called only 30 times — Nisba, Apr 02 '18 at 17:00
@Nisba Are you implying that 30 reallocation cannot take a lot of time? Technically speaking `realloc` is not used here, as vector is stupid and just allocates a new memory block, copying all data to it. Actually using `realloc` could yield better results as it can avoid copying data in some cases and can even "move" data on some OSes. — , Apr 02 '18 at 17:05
@Nisba 30 reallocations doesn't prove that it's not cause of the slowness. Remember that reallocating is a linear operation. A single reallocation of a vector of 100 million elements requires copying of 100 million elements. Memory access is not cheap. — eerorika, Apr 02 '18 at 17:06
@Nisba I removed my `realloc` update, since I feel less confident about that since I now notice that it's even faster than `reserve`. I'm not sure, but could be still related to something being optimized away due to the array not being used. — eerorika, Apr 02 '18 at 18:02
@user2079303 why in your answer there is still written there there is evidence against the non-logarithmic reallocation hypothesis? malloc is called a logarithmic number of times (and yes, it is not a reallocation because someone pointed out that it is malloc and not realloc what it is called internally by vector). — Nisba, Apr 02 '18 at 18:03
@Nisba Well, if malloc is called a logarithmic number of times, then surely that is strong evidence against the hypothesis that reallocation happens more often than logarithmically. (It's reallocation (i.e. allocating again), even if the function `realloc` isn't used) — eerorika, Apr 02 '18 at 18:05

Jerry Coffin · Answer 4 · 2018-04-02T17:56:31.440

This isn't comparing geometric growth to arithmetic (or any other) growth. It's comparing pre-allocating all the space necessary to growing the space as needed. So let's start by comparing std::vector to some code that actually does use geometric growth, and use both in ways that put the geometric growth to use¹. So, here's a simple class that does geometric growth:

class my_vect {
    int *data;
    size_t current_used;
    size_t current_alloc;
public:

    my_vect()
        : data(nullptr)
        , current_used(0)
        , current_alloc(0)
    {}

    void push_back(int val) { 
        if (nullptr == data) {
            data = new int[1];
            current_alloc = 1;
        }
        else if (current_used == current_alloc)  {
            int *temp = new int[current_alloc * 2];
            for (size_t i=0; i<current_used; i++)
                temp[i] = data[i];
            swap(temp, data);
            delete [] temp;
            current_alloc *= 2;
        }
        data[current_used++] = val;
    }

    int &at(size_t index) { 
        if (index >= current_used)
            throw bad_index();
        return data[index];
    }

    int &operator[](size_t index) { 
        return data[index];
    }

    ~my_vect() { delete [] data; }
};

...and here's some code to exercise it (and do the same with std::vector):

int main() { 
    std::locale out("");
    std::cout.imbue(out);
    using namespace std::chrono;
    std::cout << "my_vect\n";
    for (int size = 100; size <= 1000000000; size *= 10) {
        auto start = high_resolution_clock::now();

        my_vect b;
        for(int i = 0; i < size; i++) {
            b.push_back(i);
        }    

        auto stop = high_resolution_clock::now();

        std::cout << "Size: " << std::setw(15) << size << ", Time: " << std::setw(15) << duration_cast<microseconds>(stop-start).count() << " us\n";
    }

    std::cout << "\nstd::vector\n";

    for (int size = 100; size <= 1000000000; size *= 10) {
        auto start = high_resolution_clock::now();

        std::vector<int> b;
        for (int i = 0; i < size; i++) {
            b.push_back(i);
        }

        auto stop = high_resolution_clock::now();

        std::cout << "Size: " << std::setw(15) << size << ", Time: " << std::setw(15) << duration_cast<microseconds>(stop - start).count() << " us\n";
    }
}

I compiled this with g++ -std=c++14 -O3 my_vect.cpp. When I execute that, I get this result:

my_vect
Size:             100, Time:               8 us
Size:           1,000, Time:              23 us
Size:          10,000, Time:             141 us
Size:         100,000, Time:             950 us
Size:       1,000,000, Time:           8,040 us
Size:      10,000,000, Time:          51,404 us
Size:     100,000,000, Time:         442,713 us
Size:   1,000,000,000, Time:       7,936,013 us

std::vector
Size:             100, Time:              40 us
Size:           1,000, Time:               4 us
Size:          10,000, Time:              29 us
Size:         100,000, Time:             426 us
Size:       1,000,000, Time:           3,730 us
Size:      10,000,000, Time:          41,294 us
Size:     100,000,000, Time:         363,942 us
Size:   1,000,000,000, Time:       5,200,545 us

I undoubtedly could optimize the my_vect to keep up with std::vector (e.g., initially allocating space for, say, 256 items would probably be a pretty large help). I haven't attempted to do enough runs (and statistical analysis) to be at all sure that std::vector is really dependably faster than my_vect either. Nonetheless, this seems to indicate that when we compare apples to apples, we get results that are at least roughly comparable (e.g., within a fairly small, constant factor of each other).

^{1. As a side note, I feel obliged to point out that this still doesn't really compare apples to apples--but at least as long as we're only instantiating std::vector over int, many of the obvious differences are basically covered up.}

JiaHao Xu · Answer 5 · 2018-05-06T12:39:59.163

This post include

wrapper classes over realloc, mremap to provide reallocation functionality.
A custom vector class.
A performance test.

// C++17
#include <benchmark/benchmark.h> // Googleo benchmark lib, for benchmark.

#include <new>     // For std::bad_alloc.
#include <memory>  // For std::allocator_traits, std::uninitialized_move.
#include <cstdlib> // For C heap management API.
#include <cstddef> // For std::size_t, std::max_align_t.
#include <cassert> // For assert.
#include <utility> // For std::forward, std::declval,

namespace linux {
#include <sys/mman.h> // For mmap, mremap, munmap.
#include <errno.h>    // For errno.
auto get_errno() noexcept {
    return errno;
}
}

/*
 * Allocators.
 * These allocators will have non-standard compliant behavior if the type T's cp ctor has side effect.
 */

// class mrealloc are usefull for allocating small space for
// std::vector.
//
// Can prevent copy of data and memory fragmentation if there's enough
// continous memory at the original place.
template <class T>
struct mrealloc {
    using pointer = T*;
    using value_type = T;

    auto allocate(std::size_t len) {
        if (auto ret = std::malloc(len))
            return static_cast<pointer>(ret);
        else
            throw std::bad_alloc();
    }
    auto reallocate(pointer old_ptr, std::size_t old_len, std::size_t len) {
        if (auto ret = std::realloc(old_ptr, len))
            return static_cast<pointer>(ret);
        else
            throw std::bad_alloc();
    }
    void deallocate(void *ptr, std::size_t len) noexcept {
        std::free(ptr);
    }
};

// class mmaprealloc is suitable for large memory use.
//
// It will be usefull for situation that std::vector can grow to a huge
// size.
//
// User can call reserve without worrying wasting a lot of memory.
//
// It can prevent data copy and memory fragmentation at any time.
template <class T>
struct mmaprealloc {
    using pointer = T*;
    using value_type = T;

    auto allocate(std::size_t len) const
    {
        return allocate_impl(len, MAP_PRIVATE | MAP_ANONYMOUS);
    }
    auto reallocate(pointer old_ptr, std::size_t old_len, std::size_t len) const
    {
        return reallocate_impl(old_ptr, old_len, len, MREMAP_MAYMOVE);
    }
    void deallocate(pointer ptr, std::size_t len) const noexcept
    {
        assert(linux::munmap(ptr, len) == 0);
    }
  protected:
    auto allocate_impl(std::size_t _len, int flags) const
    {
        if (auto ret = linux::mmap(nullptr, get_proper_size(_len), PROT_READ | PROT_WRITE, flags, -1, 0))
            return static_cast<pointer>(ret);
        else
            fail(EAGAIN | ENOMEM);
    }
    auto reallocate_impl(pointer old_ptr, std::size_t old_len, std::size_t _len, int flags) const
    {
        if (auto ret = linux::mremap(old_ptr, old_len, get_proper_size(_len), flags))
            return static_cast<pointer>(ret);
        else
            fail(EAGAIN | ENOMEM);
    }

    static inline constexpr const std::size_t magic_num = 4096 - 1;
    static inline auto get_proper_size(std::size_t len) noexcept -> std::size_t {
        return round_to_pagesize(len);
    }
    static inline auto round_to_pagesize(std::size_t len) noexcept -> std::size_t {
        return (len + magic_num) & ~magic_num;
    }

    static inline void fail(int assert_val)
    {
        auto _errno = linux::get_errno();
        assert(_errno == assert_val);
        throw std::bad_alloc();
    }
};

template <class T>
struct mmaprealloc_populate_ver: mmaprealloc<T> {
    auto allocate(size_t len) const
    {
        return mmaprealloc<T>::allocate_impl(len, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE);
    }
};

namespace impl {
struct disambiguation_t2 {};
struct disambiguation_t1 {
    constexpr operator disambiguation_t2() const noexcept { return {}; }
};
template <class Alloc>
static constexpr auto has_reallocate(disambiguation_t1) noexcept -> decltype(&Alloc::reallocate, bool{}) { return true; }
template <class Alloc>
static constexpr bool has_reallocate(disambiguation_t2) noexcept { return false; }
template <class Alloc>
static inline constexpr const bool has_reallocate_v = has_reallocate<Alloc>(disambiguation_t1{});
} /* impl */

template <class Alloc>
struct allocator_traits: public std::allocator_traits<Alloc> {
    using Base = std::allocator_traits<Alloc>;
    using value_type = typename Base::value_type;
    using pointer = typename Base::pointer;
    using size_t = typename Base::size_type;

    static auto reallocate(Alloc &alloc, pointer prev_ptr, size_t prev_len, size_t new_len) {
        if constexpr(impl::has_reallocate_v<Alloc>)
            return alloc.reallocate(prev_ptr, prev_len, new_len);
        else {
            auto new_ptr = Base::allocate(alloc, new_len);

            // Move existing array
            for(auto _prev_ptr = prev_ptr, _new_ptr = new_ptr; _prev_ptr != prev_ptr + prev_len; ++_prev_ptr, ++_new_ptr) {
                new (_new_ptr) value_type(std::move(*_prev_ptr));
                _new_ptr->~value_type();
            }
            Base::deallocate(alloc, prev_ptr, prev_len);

            return new_ptr;
        }
    }
};

template <class T, class Alloc = std::allocator<T>>
struct vector: protected Alloc {
    using alloc_traits = allocator_traits<Alloc>;
    using pointer = typename alloc_traits::pointer;
    using size_t = typename alloc_traits::size_type;
    pointer ptr = nullptr;
    size_t last = 0;
    size_t avail = 0;

    ~vector() noexcept {
        alloc_traits::deallocate(*this, ptr, avail);
    }

    template <class ...Args>
    void emplace_back(Args &&...args) {
        if (last == avail)
            double_the_size();
        alloc_traits::construct(*this, &ptr[last++], std::forward<Args>(args)...);
    }
    void double_the_size() {
        if (__builtin_expect(!!(avail), true)) {
            avail <<= 1;
            ptr = alloc_traits::reallocate(*this, ptr, last, avail);
        } else {
            avail = 1 << 4;
            ptr = alloc_traits::allocate(*this, avail);
        }
    }
};

template <class T>
static void BM_vector(benchmark::State &state) {
    for(auto _: state) {
        T c;
        for(auto i = state.range(0); --i >= 0; )
            c.emplace_back((char)i);
    }
}

static constexpr const auto one_GB = 1 << 30;

BENCHMARK_TEMPLATE(BM_vector, vector<char>)                                ->Range(1 << 3, one_GB);
BENCHMARK_TEMPLATE(BM_vector, vector<char, mrealloc<char>>)                ->Range(1 << 3, one_GB);
BENCHMARK_TEMPLATE(BM_vector, vector<char, mmaprealloc<char>>)             ->Range(1 << 3, one_GB);
BENCHMARK_TEMPLATE(BM_vector, vector<char, mmaprealloc_populate_ver<char>>)->Range(1 << 3, one_GB);
BENCHMARK_MAIN();

Performance test.

All the performance test are done on:

Debian 9.4, Linux version 4.9.0-6-amd64 (debian-kernel@lists.debian.org)(gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02)

Compiled using clang++ -std=c++17 -lbenchmark -lpthread -Ofast main.cc

The command I used to run this test:

sudo cpupower frequency-set --governor performance
./a.out

Here's the output of google benchmark test:

Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
----------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time           CPU Iterations
----------------------------------------------------------------------------------------------------------
BM_vector<vector<char>>/8                                                  58 ns         58 ns   11476934
BM_vector<vector<char>>/64                                                324 ns        324 ns    2225396
BM_vector<vector<char>>/512                                              1527 ns       1527 ns     453629
BM_vector<vector<char>>/4096                                             7196 ns       7196 ns      96695
BM_vector<vector<char>>/32768                                           50145 ns      50140 ns      13655
BM_vector<vector<char>>/262144                                         549821 ns     549825 ns       1245
BM_vector<vector<char>>/2097152                                       5007342 ns    5006393 ns        146
BM_vector<vector<char>>/16777216                                     42873349 ns   42873462 ns         15
BM_vector<vector<char>>/134217728                                   336225619 ns  336097218 ns          2
BM_vector<vector<char>>/1073741824                                 2642934606 ns 2642803281 ns          1
BM_vector<vector<char, mrealloc<char>>>/8                                  55 ns         55 ns   12914365
BM_vector<vector<char, mrealloc<char>>>/64                                266 ns        266 ns    2591225
BM_vector<vector<char, mrealloc<char>>>/512                              1229 ns       1229 ns     567505
BM_vector<vector<char, mrealloc<char>>>/4096                             6903 ns       6903 ns     102752
BM_vector<vector<char, mrealloc<char>>>/32768                           48522 ns      48523 ns      14409
BM_vector<vector<char, mrealloc<char>>>/262144                         399470 ns     399368 ns       1783
BM_vector<vector<char, mrealloc<char>>>/2097152                       3048578 ns    3048619 ns        229
BM_vector<vector<char, mrealloc<char>>>/16777216                     24426934 ns   24421774 ns         29
BM_vector<vector<char, mrealloc<char>>>/134217728                   262355961 ns  262357084 ns          3
BM_vector<vector<char, mrealloc<char>>>/1073741824                 2092577020 ns 2092317044 ns          1
BM_vector<vector<char, mmaprealloc<char>>>/8                             4285 ns       4285 ns     161498
BM_vector<vector<char, mmaprealloc<char>>>/64                            5485 ns       5485 ns     125375
BM_vector<vector<char, mmaprealloc<char>>>/512                           8571 ns       8569 ns      80345
BM_vector<vector<char, mmaprealloc<char>>>/4096                         24248 ns      24248 ns      28655
BM_vector<vector<char, mmaprealloc<char>>>/32768                       165021 ns     165011 ns       4421
BM_vector<vector<char, mmaprealloc<char>>>/262144                     1177041 ns    1177048 ns        557
BM_vector<vector<char, mmaprealloc<char>>>/2097152                    9229860 ns    9230023 ns         74
BM_vector<vector<char, mmaprealloc<char>>>/16777216                  75425704 ns   75426431 ns          9
BM_vector<vector<char, mmaprealloc<char>>>/134217728                607661012 ns  607662273 ns          1
BM_vector<vector<char, mmaprealloc<char>>>/1073741824              4871003928 ns 4870588050 ns          1
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/8                3956 ns       3956 ns     175037
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/64               5087 ns       5086 ns     133944
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/512              8662 ns       8662 ns      80579
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/4096            23883 ns      23883 ns      29265
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/32768          158374 ns     158376 ns       4444
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/262144        1171514 ns    1171522 ns        593
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/2097152       9297357 ns    9293770 ns         74
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/16777216     75140789 ns   75141057 ns          9
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/134217728   636359403 ns  636368640 ns          1
BM_vector<vector<char, mmaprealloc_populate_ver<char>>>/1073741824 4865103542 ns 4864582150 ns          1

sys/mman.h is POSIX standard, not glibc-specific. You're using `posix::mremap`, but you put the header in `namespace glibc {}`. So that's doubly backwards. Besides not compiling, `mremap` (unlike mmap / munmap) is Linux-only, so `glibc::mremap` would be semi-appropriate, but glibc is portable to other kernels (at least HURD I think, maybe FreeBSD) so really you should use `linux::mremap` unless other kernels have it. Also, you said "may" need a custom vector class to take advantage of this. Did you check if any existing `std::vector` implementations will take advantage of this directly? — Peter Cordes, Apr 29 '18 at 13:46
Your `mmaprealloc` returns pointers that are only 8-byte aligned (or 4-byte on 32-bit), violating the ABI. (x86-64 System V's `max_align_t` has 16-byte alignment for `long double` and something else. The i386 SysV ABI has 8-byte `max_align_t` for something, at least for `atomic`). Use `ret += alignof(max_align_t) / sizeof(size_t)` or something, and `static_assert` that it divides evenly. — Peter Cordes, Apr 29 '18 at 13:51
I'd like to upvote this if you fix those bugs and include a little more description of if / how you could *actually* get the benefit of this for a `std::vector` with glibc on Linux, for example. Presumably you'd have to hack the `` header yourself? — Peter Cordes, Apr 29 '18 at 13:56
I think the align is not a problem for mmaprealloc, because memory addresses returned by mmap and mremap are always page-aligned. — JiaHao Xu, Apr 29 '18 at 23:00
Right, but then you offset them by *one* `sizeof(size_t)`, which is less than `alignof(max_align_t)`. You use the first 4 or 8 bytes of the page to store the length of the allocation, meaning the pointer you actually return is only 4 or 8 byte aligned. — Peter Cordes, Apr 30 '18 at 04:51
Just noticed you're using `MAP_UNINITIALIZED`. That reminds me, `std::vector` + compilers don't know how to take advantage of `calloc` to construct zero-initialized objects. They always dirty pages with a memset equivalent, even if they happen to come straight from the OS with `mmap` (without MAP_UNINITIALIZED, which most systems ignore anyway for security). So if you're providing new interfaces for containers to use, a `calloc` would be good. — Peter Cordes, Apr 30 '18 at 05:00

Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?

5 Answers5

Linked