vector reserve + emplace_back slower than plain array

Question

I'm aware that vectors with resize or/and push_back are slower than plain arrays due to additional creations/copies of objects. Most questions asked before about slowness of vectors are about resize/push_back slowness. Usual advice is to use reserve+emplace_back because they're faster than push_back.

I've done some benchmarks according to whiches reserve+emplace_back are much slower than plain arrays. Why? Am I missing something that makes comparison unfair?

Code:

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <vector>
#include <chrono>
#include <string>
#include <algorithm>
#include <climits>
using tp = std::chrono::steady_clock::time_point;
class Time {
public:
    static void show(tp t1, tp t2) { //time passed since t1
        std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << '\t';
        printf("nanoseconds \n");
    }
    tp add() {
        tp p = std::chrono::steady_clock::now();
        return p;
    }
};

int main()
{
    Time time;
    const int VSIZE = 1000000;
    auto t1 = time.add();
    double vsizearr[VSIZE];
    for (auto i = 0; i < VSIZE; i++) {
        vsizearr[i]=i;
        asm volatile("" : : : "memory"); //doesn't allow compiler to erase the loop
    }
    auto t2 = time.add();
    std::vector<double> second;
    second.reserve(VSIZE);
    for (auto i = 0; i < VSIZE; i++) {
        second.emplace_back(i);
        asm volatile("" : : : "memory"); //doesn't allow compiler to erase the loop
    }
    auto t3 = time.add();
    time.show(t1, t2);
    time.show(t2, t3);
    return 0;
}

Benchmarks results:

Windows 10, C++17, LLVM-clang, -O1

306300  nanoseconds
1824700 nanoseconds

Ubuntu 20.04, C++17, g++ and clang

root@vlad-VirtualBox:/home/vlad/Documents clang++ -O1 -std=c++17 -o test test.cpp
root@vlad-VirtualBox:/home/vlad/Documents ./test
284340 nanoseconds 
3115997 nanoseconds 
root@vlad-VirtualBox:/home/vlad/Documents g++ -O1 -std=c++17 -o test test.cpp
root@vlad-VirtualBox:/home/vlad/Documents ./test
284340 nanoseconds 
3145702 nanoseconds

-O3 results are about the same as -O1.

In your 1st version the array is on the stack. In the 2nd, the `std::vector` is allocating on the heap. Use `new[]` + `delete[]` in the 1st version for a fair comparison. — wohlstad, Dec 11 '22 at 16:29
vsizearr accesses are entirely optimised out as the compiler sees it been set but not used ever. — Öö Tiib, Dec 11 '22 at 16:35

score 3 · Answer 1 · answered Dec 11 '22 at 16:33

The benchmark is biased: it does not measure what you think. Indeed, both GCC 12.2 and Clang 15.0 optimize the first loop so nothing is actually written. See this on GodBolt.

Here is the code for the first loop with GCC and Clang:

; [GCC]
        mov     eax, 1000000
.L31:
        sub     eax, 1
        jne     .L31

------------------------------

; [Clang]
        mov     ebx, 1000000
.LBB0_1:
        dec     ebx
        jne     .LBB0_1

As you can see, the loop is empty and useless, so it is not surprising it is fast. The compiler optimize it since the array is not read.

For the second loop, here is the assembly code of the loop:

; [GCC]
.L43:
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, eax
        movsd   QWORD PTR [rsi], xmm0
        add     rsi, 8
        mov     QWORD PTR [rsp+24], rsi
.L33:
        mov     eax, DWORD PTR [rsp+12]
        add     eax, 1
        mov     DWORD PTR [rsp+12], eax
        cmp     eax, 999999
        jg      .L42
.L34:
        mov     rsi, QWORD PTR [rsp+24]
        cmp     rsi, QWORD PTR [rsp+32]
        jne     .L43
        lea     rdx, [rsp+12]
        lea     rdi, [rsp+16]
        call    void std::vector<double, std::allocator<double> >::_M_realloc_insert<int&>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, int&)
        jmp     .L33

------------------------------

; [Clang]
.LBB0_5:                                #   in Loop: Header=BB0_4 Depth=1
        xorps   xmm0, xmm0
        cvtsi2sd        xmm0, r15d
        movsd   qword ptr [rbx], xmm0
.LBB0_32:                               #   in Loop: Header=BB0_4 Depth=1
        add     rbx, 8
        inc     r15d
        cmp     r15d, 1000000
        je      .LBB0_6
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        cmp     rbx, rcx
        jne     .LBB0_5
[...]

As you can see, GCC does not optimize the call to _M_realloc_insert. Both keep the array writes, not to mention the expensive int-to-double conversion. There is an expensive logic so to know whether the array need to be resized. This is needed since the compiler can hardly know this part is actually not useful.

Thanks. I thought asm volatile always helps to prevent too-much optimization. I was wrong. — Vladislav Kogan, Dec 11 '22 at 16:40

vector reserve + emplace_back slower than plain array

1 Answers1