7

I have profiled the performance between c++ vector and c-style array. The result is a little bit unexpected since the literature says the performance of vector should come very close to raw array, but it does not. Did I do anything wrong in my profiling?

void getVector1(int n)
{
    if (n < 0)
    {
        throw std::invalid_argument(std::string("negative argument n:") + std::to_string(n));
    }

    auto tp1 = std::chrono::steady_clock::now();
    std::vector<int> ivec(n);
    int i = 0;
    for (auto& x : ivec)
    {
        x = ++i;
    }

    auto tp2 = std::chrono::steady_clock::now();
    std::chrono::duration<double, std::micro> dd = tp2 - tp1;

    printf("spend %6.2f us time to create: %d elements vector inside %s() at %s:%d \n", dd.count(), n, __func__, __FILE__, __LINE__);
}


void getVector2(int n)
{
    if (n < 0)
    {
        throw std::invalid_argument(std::string("negative argument n:") + std::to_string(n));
    }

    auto tp1 = std::chrono::steady_clock::now();
    auto pvec = new int[n];

    for (int i = 0; i < n; ++i)
    {
        pvec[i] = i;
    }

    auto tp2 = std::chrono::steady_clock::now();
    std::chrono::duration<double, std::micro> dd = tp2 - tp1;

    delete[] pvec;
    printf("spend %6.2f us time to create: %d elements vector inside %s() at %s:%d \n", dd.count(), n, __func__, __FILE__, __LINE__);
}



int main()
{
    int n = 10000000;
    getVector1(n);
    getVector2(n);

    return 0;
}

The code was compiled using g++ with -O3 option.

spend 11946.38 us time to create: 10000000 elements vector inside getVector1() at testVectorSpeed.cpp

spend 7298.66 us time to create: 10000000 elements vector inside getVector2() at testVectorSpeed.cpp

pchrys
  • 189
  • 3
  • It might be helpful to time the creation separately from the loop that fills it in. – Barmar May 21 '18 at 17:52
  • Also, check the generated code, it might have optimized the second loop out. – Barmar May 21 '18 at 17:53
  • 1
    I would expect a good compiler to realize that you dont use the vector or the array, try use them (outside of the timing). Also you should run the measurement several times and also change the order of the calls – 463035818_is_not_an_ai May 21 '18 at 17:54
  • 1
    @user463035818 heap elision is not in gcc yet AFAIK, so the side effects should still kick in. –  May 21 '18 at 17:57
  • 2
    When you create the vector, there is a call to `memset()`, because the ints are default-constructed. If you used `reserve()` and `emplace_back()` instead, you would probably get much better performance out of it. –  May 21 '18 at 18:00
  • Related: https://stackoverflow.com/q/3664272/1896169 – Justin May 21 '18 at 18:05
  • 3
    One real difference is that elements in vector are default-initialized, but no much in the dynamic array. Try `vec = new int[n]();` – SergeyA May 21 '18 at 18:23
  • If my answer answered your question, you can [accept it](https://meta.stackexchange.com/a/5235/218012) by clicking on the check mark. If it didn't answer your question, can you comment to elaborate on what you are missing? – Justin Jul 06 '18 at 18:21

1 Answers1

8

This cost comes down to vector zeroing out the memory via its allocator.


First, it's always a good idea to use a benchmarking library like google benchmark rather than rolling your own benchmarking. We can use quick-bench.com to quickly use the library. Rewriting your code to use this:

// Just the benchmark code:
void getVector1(benchmark::State& state)
{
    int n = state.range(0);

    for (auto _ : state) {
      std::vector<int> ivec(n);

      // This is the same operation that you are doing
      std::iota(ivec.begin(), ivec.end(), 1);

      // We don't want the compiler to see that we aren't
      // using `ivec` and thus optimize away the entire
      // loop body
      benchmark::DoNotOptimize(ivec);
    }
}

void getArray1(benchmark::State& state)
{
    int n = state.range(0);

    for (auto _ : state) {
      auto pvec = new int[n];

      std::iota(pvec, pvec + n, 1);

      benchmark::DoNotOptimize(pvec);

      delete[] pvec;
    }
}

// Smaller number still reproduces it
BENCHMARK(getVector1)->Arg(10000);
BENCHMARK(getArray1)->Arg(10000);

benchmark OP's code

Click on image for quick-bench link

Through a little playing around, we can find that the cost difference is just the cost of zeroing out the memory with std::uninitialized_fill (on quick-bench).

Indeed, if we instead use an allocator that leaves the memory uninitialized, there is no measurable difference between the two:

// Allocator from https://stackoverflow.com/a/41049640
template <typename T, typename A = std::allocator<T>>
class default_init_allocator : public A {
    typedef std::allocator_traits<A> a_t;
public:
    // http://en.cppreference.com/w/cpp/language/using_declaration
    using A::A; // Inherit constructors from A

    template <typename U> struct rebind {
        using other =
            default_init_allocator
            <  U, typename a_t::template rebind_alloc<U>  >;
    };

    template <typename U>
    void construct(U* ptr)
        noexcept(std::is_nothrow_default_constructible<U>::value) {
        ::new(static_cast<void*>(ptr)) U;
    }

    template <typename U, typename...Args>
    void construct(U* ptr, Args&&... args) {
        a_t::construct(static_cast<A&>(*this),
            ptr, std::forward<Args>(args)...);
    }
};

void getVector1(benchmark::State& state)
{
    int n = state.range(0);

    for (auto _ : state) {
      std::vector<int, default_init_allocator<int>> ivec(n);

      std::iota(ivec.begin(), ivec.end(), 1);

      benchmark::DoNotOptimize(ivec);
    }
}

void getArray1(benchmark::State& state)
{
    int n = state.range(0);

    for (auto _ : state) {
      auto pvec = new int[n];

      std::iota(pvec, pvec + n, 1);

      benchmark::DoNotOptimize(pvec);

      delete[] pvec;
    }
}

BENCHMARK(getVector1)->Arg(10000);
BENCHMARK(getArray1)->Arg(10000);

benchmark uninitialized allocator

Click on image for quick-bench link

Justin
  • 24,288
  • 12
  • 92
  • 142
  • 1
    Honestly I suspect you could get this same result just using `reserve` and `emplace_back` instead. – Mgetz May 21 '18 at 18:51
  • 1
    @Mgetz Using `reserve` and `emplace_back` [is actually really slow](http://quick-bench.com/3y54BE2_w92f64vc9ZtMEv7FFNw) (even if I use `std::back_inserter` with `std::iota`, which is slightly different, but still...). I thought so too, but it's even worse than the original vector code – Justin May 21 '18 at 18:54
  • 2
    Something smells, I'd want to profile that directly to get the hot line – Mgetz May 21 '18 at 19:10
  • @Mgetz Please do! I really want to know why it's worse. I've been using `reserve` + `emplace_back` all over the place in my own code. I'd love to see what you can find – Justin May 21 '18 at 19:11
  • @Mgetz Looking at [the disassembly on godbolt](https://godbolt.org/g/PqdX9f), for gcc it seems that the code keeps checking whether the capacity needs to be increased, even though we know it doesn't. For clang, it seems that clang couldn't produce vectorized instructions with the `reserve` + `emplace_back` version – Justin May 21 '18 at 19:20
  • You'd think that could be optimized out if you make `n` `const` – Mgetz May 21 '18 at 19:26
  • I'd be curious to see how it fairs with something more complicated than `int` to be honest... something that required more complicated construction. – Mgetz May 21 '18 at 19:29
  • @Mgetz Yes, for more complex types, it [performs just fine](http://quick-bench.com/9p7fgdqnB-FsgbGcjvRNs_fFF_8), and if the default constructor does more work, it could out-perform – Justin May 21 '18 at 19:34
  • @Justin I do not see how vector could out perform. AFAIK it always does >= amount of work as the array(in some cases value construction is more expansive than default construction, in some it is the same) – NoSenseEtAl May 21 '18 at 22:33
  • @NoSenseEtAl In that comment, I was referring to using vector by `vector(n)` and `vector[i] = ...` vs `vector.reserve(n)` and `vector.emplace_back(...)`. I was saying that the latter can outperform the former when the default constructor of the type does notable work. For example, suppose `MyType`'s default constructor allocated memory. All those allocations would be wasted, since I'm going to overwrite it anyway. That's why `reserve` + `emplace_back` can outperform `vector(n)` + `vector[i] = ...` – Justin May 21 '18 at 22:38
  • @NoSenseEtAl It's also worth noting that with `new MyType[n]`, the default constructor will also be run, so vector `reserve` + `emplace_back` can outperform in that case – Justin May 21 '18 at 22:42
  • Ah ok. Also There is related proposal (it is for string, not vector) http://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p1072r0.html – NoSenseEtAl May 21 '18 at 22:44