Why is 'resize-remove' faster than 'erase-remove' on vectors?

Question

There is an 'erase-remove' idiom in C++ when it comes to removing several elements from containers, and there are discussions about an alternative 'resize-remove' way, e.g., here. People say that 'erase-remove' is better than 'resize-remove', but according to my tests, the latter is (slightly) faster on vectors. So, should I use 'resize-remove' when it comes to vectors?

Here's my benchmarking code:

#include <benchmark/benchmark.h>

#include <algorithm>
#include <functional>
#include <iostream>
#include <random>
#include <vector>

using namespace std;

constexpr size_t N_ELEMS = 1000000;
constexpr int MAX_VAL = N_ELEMS / 10;
constexpr int THRESH = MAX_VAL / 5 * 3;

static vector<int> generate_input() {
  vector<int> nums(N_ELEMS);

  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> dist(0, N_ELEMS);

  std::generate(nums.begin(), nums.end(), std::bind(dist, std::ref(gen)));

  return std::move(nums);
}

static void bm_erase_remove(benchmark::State &state) {
  for (auto _ : state) {
    state.PauseTiming();
    auto nums = generate_input();
    state.ResumeTiming();
    nums.erase(std::remove_if(nums.begin(), nums.end(),
                              [](int x) { return x < THRESH; }),
               nums.end());
    benchmark::DoNotOptimize(nums);
  }
}
BENCHMARK(bm_erase_remove);

static void bm_resize_remove(benchmark::State &state) {
  for (auto _ : state) {
    state.PauseTiming();
    auto nums = generate_input();
    state.ResumeTiming();

    nums.resize(std::distance(
        nums.begin(), std::remove_if(nums.begin(), nums.end(),
                                     [](int x) { return x < THRESH; })));
    benchmark::DoNotOptimize(nums);
  }
}
BENCHMARK(bm_resize_remove);

BENCHMARK_MAIN();

Output:

$ g++ main.cpp -lbenchmark -O3 -pthread
$ ./a.out 
2023-05-24T20:07:22+08:00
Running ./a.out
Run on (16 X 3193.91 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.16, 0.14, 0.16
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
bm_erase_remove      822789 ns       759162 ns          838
bm_resize_remove     818217 ns       754749 ns          935

The difference is larger when using clang++:

$ clang++ main.cpp -lbenchmark -O3 -pthread
$ ./a.out
Load Average: 0.25, 0.18, 0.17
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
bm_erase_remove     1165085 ns      1074667 ns          611
bm_resize_remove     958856 ns       884584 ns          782

Extra information:

g++'s version is 13.1.1, clang++'s version is 15.0.7
I'm using Arch Linux on WSL, kernel version is 5.15.90.1-microsoft-standard-WSL2
CPU Model is AMD Ryzen 7 6800H with Radeon Graphics

UPDATE: What is interesting is that, when I run the benchmarks individually (using the benchmark_filter option), the results are equivalent. Is this because of caching? If so, how does caching mechanism work?

UPDATE (2023/5/25): If the two BENCHMARK statements get swapped, a totally opposite result is showed.

interesting observation. You might want to try with elements whose destructor actually does something to see how that effects the timings. — 463035818_is_not_an_ai, May 24 '23 at 13:36
On my Skylake i7-6700k @ 3.9GHz desktop (Arch Linux on bare metal), clang timings for both functions are also very close if I apply [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) - `clang++ -O3 -mbranches-within-32B-boundaries -lbenchmark benchmark-erase.cpp -o bench-clang-skylake` - 921508 ns vs. 917859 ns. Or another run with 921106 ns vs. 921149 ns CPU time. Without that workaround for CPU performance potholes introduced by microcode updates, I got times more like yours, presumably caused by bad luck of code alignment. — Peter Cordes, May 24 '23 at 13:45
It seems on my system, the measurement noise comes out in favour of `erase` as often as it does in favour of `remove`, with clang compiled to avoid SKL performance problems. With GCC, `resie` does come out consistently faster. like 933 us vs. 885 us, or 922 us vs. 905 us. — Peter Cordes, May 24 '23 at 13:50
It is nearly the same. Please try to change the order of function calls. It should show the opposite result. CPU is probably already on "turbo mode" when you leave the first call but in the first call the CPU needs to leave the "energy saving mode" first. — no one special, May 24 '23 at 13:51
*Is this because of caching?* - Unlikely. Even if the memory for the `std::vector` was fresh from the kernel (and would page-fault) vs. from the free list and hits in L3 cache, `generate_array` writes every element with timing paused. So assuming that pausing the timing works without introducing much variation itself, the situation is about the same regardless. Perhaps there's some branch aliasing that lets it "learn" that e.g. the loop branch is almost always taken? I'd be kind a little suspicious of pausing timing to run a slow PRNG like MT. — Peter Cordes, May 24 '23 at 13:56
Erasing elements from the end is a special case for `std::vector::erase`, but if an erasure happens in `std::vector::resize`, it always happens at the end. It's likely that the additional call to [`std::move(last, end(), first);`](https://en.cppreference.com/w/cpp/algorithm/move) (or equivalent logic) results in the decreased performance even though no elements are actually moved. — fabian, May 24 '23 at 13:56
What CPU model do you have? Is it Skylake-derived and thus affected by the JCC erratum? If so, does enabling compiler workarounds for it make a difference? Since `benchmark_filter` doesn't involve recompiling, a benchmark loop that's slowed down by that should be slow whether or not earlier or later code ran, though. — Peter Cordes, May 24 '23 at 14:01
@fabian Could you please explain where does the call to `std::move` happen? — rhanqtl, May 25 '23 at 06:52
Ok, so that's a Zen 3 microarchitecture. AMD CPUs constantly adjust their max boost clock speed according to thermal and power conditions, so microbenchmarking is hard, especially in a laptop where thermal constraints are more likely to be a limiting factor. If you can disable boost clocks, that might give more stable results. Your array size (3.8 MiB) is smaller than your L3 cache size (16MiB) by a comfortable margin, and plenty larger than the L2 cache of a single core, so you're not right on the edge of a cache size. — Peter Cordes, May 25 '23 at 06:54
*If the two BENCHMARK statements get swapped, a totally opposite result is showed.* - Then the difference you're seeing isn't due to the anything important in the C++ source, as expected since these two tests should be doing the same amount of work. Erasing from the tail should be equivalent to resize. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) discusses various factors like warm-up effects that can lead to variations, and mentions running tests in the other order as a way to see if there's a real effect or not. — Peter Cordes, May 25 '23 at 06:59
@rhanqtl Let's say you erase elements at indices 1 to 3 from a vector `v` of 10, then basically what happens is `std::move(v.begin() + 4, v.end(), v.begin() + 1); v.resize(7);`, i.e. every element after the erased range gets moved to the places where the erased elements were. The range not having any elements after it is a special case where this `std::move` happens to not move any elements. — fabian, May 25 '23 at 07:04

SrPanda · Answer 1 · 2023-06-05T16:45:53.677

They are about the same performance, this is because the std::remove_if is the only function that modifies the array, then the difference comes form the other functions, ~~std::vector::resize makes a realloc to fit the new size (std::distance just returns the size so it's negligible) and std::vector::erase just adopts the container making it slightly faster.~~

As pointed by @Peter Cordes "It's actually guaranteed not to reallocate", std::vector::resize does not resize the vector if the new size is smaller, so the difference should be from an extra move(vs, clang) that erase (vs, clang) does and resize(vs, clang) doesn't.

To be able to mesure differences generate_input() needs to return the same vector for all of the tests, in your implementation every call returns a new random vector that makes imposible to tell apart a run to run variance from the vectors changing.

With that, @463035818_is_not_a_number makes an interesting point, but for the same reason as before, there is no difference between those functions. That doesn't mean that it's the same case, for a struct the penalty from a bad branch much greater making std::remove_if a prime target for optimization, see in the windows figures bellow.

All of the figures are from 50 runs each, green is the range between the best run and the average and the red is the range from the average to the worst result, this is to show the variance from run to run.

i5-1155G7 + 3200MHz RAM on manjaro with clang-13 and -O3

1600X + 1067MHz RAM on windows 10 (22H2) with vs2022 and /O2

1600X + 1067MHz RAM on windows 10 (22H2) with clang-16 and -O3 -pthread

1600X + 1067MHz RAM on ubuntu 22 with clang-13 and -O3

With vs, it seems that there is an optimization bug (?), that makes simple_remove underperform with u32 and u64.

#include <cstdint>
#include <random>
#include <cstring>
#include <vector>
#include <tuple>

// #include <typeinfo> // For clang

#include <benchmark/benchmark.h>

#pragma warning(disable:4267)  // Conversion errors
#pragma warning(disable:4244)  // Conversion errors

constexpr int N_ELEMS   = 500;
constexpr int RAND_SEED = 8888;

template <typename T>
struct Filler {

    T * ptr = nullptr;

    int64_t padd [sizeof(T)];

    Filler() {
        ptr = new T(0);
        memset(padd, 0, sizeof(T) * 8);
    }

    Filler(const T num){
        ptr = new T(num);
        for (int64_t& e : padd){
            e = num;
        }
    }

    Filler(const Filler& other){
        ptr = new T(*other.ptr);
        memcpy(padd, other.padd, sizeof(T) * 8);
    }

    ~Filler() {
        delete ptr;
    }

    Filler& operator=(Filler const& other){
        memcpy(padd, other.padd, sizeof(T) * 8);
        *ptr = *other.ptr;
        return *this;
    }

    inline bool operator <  (const Filler& other) { return *ptr <  *other.ptr; }
    inline bool operator <= (const Filler& other) { return *ptr <= *other.ptr; }
    inline bool operator >  (const Filler& other) { return *ptr >  *other.ptr; }
    inline bool operator >= (const Filler& other) { return *ptr >= *other.ptr; }
    inline bool operator == (const Filler& other) { return *ptr == *other.ptr; }
    inline bool operator != (const Filler& other) { return *ptr != *other.ptr; }

    inline bool operator <  (const T other) { return *ptr <  other; }
    inline bool operator <= (const T other) { return *ptr <= other; }
    inline bool operator >  (const T other) { return *ptr >  other; }
    inline bool operator >= (const T other) { return *ptr >= other; }
    inline bool operator == (const T other) { return *ptr == other; }
    inline bool operator != (const T other) { return *ptr != other; }

};


static size_t THRESH;   

template <typename T>
struct Foo {


    static std::vector<T> generate_input(size_t max = 0) {
        static size_t dist_max = 0;
        static std::vector<T> nums;

        if (nums.empty() || max){

            if (max) {
                THRESH = max / 2;
                dist_max = max;
            }

            std::mt19937 gen(RAND_SEED);
            std::uniform_int_distribution<uint64_t> dist(0, dist_max);

            for (auto& n : nums = std::vector<T>(N_ELEMS)){
                n = T(dist(gen));
            }
        }
        return nums;
    }

    static void just_remove(benchmark::State &state) {
      for (auto _ : state) {

        state.PauseTiming(); 
        std::vector<T> nums = generate_input();
        state.ResumeTiming();

        std::ignore = std::remove_if(
            nums.begin(), nums.end(),
            [](T x) { return x < THRESH; }
        );

        benchmark::DoNotOptimize(nums);
      }
    }

    static void erase_remove(benchmark::State &state) {
      for (auto _ : state) {

        state.PauseTiming();
        std::vector<T> nums = generate_input();
        state.ResumeTiming();

        nums.erase(
            std::remove_if(
                nums.begin(), nums.end(),
                [](T x) { return x < THRESH; }
            ),
            nums.end()
        );

        benchmark::DoNotOptimize(nums);
      }
    }

    static void resize_remove(benchmark::State &state) {
      for (auto _ : state) {

        state.PauseTiming(); 
        std::vector<T> nums = generate_input();
        state.ResumeTiming();

        nums.resize(
          std::distance(
            nums.begin(), 
            std::remove_if(
                nums.begin(), nums.end(),
                [](T x) { return x < THRESH; }
            )
          )
        );

        benchmark::DoNotOptimize(nums);
      }
    }


    static void simple_remove(benchmark::State &state) {
      for (auto _ : state) {

        state.PauseTiming();
        std::vector<T> nums = generate_input();
        state.ResumeTiming();

        T * n = &nums.front();
        T * m = &nums.front();

        const T thresh = T(THRESH);
        const T * back = &nums.back();
        do {

            if (*m >= thresh){
                *(n++) = std::move(*m);
            }

        } while (m++ < back);

        nums.resize(n - &nums.front());

        benchmark::DoNotOptimize(nums);
      }
    }

    static void simple_remove_unroll(benchmark::State &state) {
      for (auto _ : state) {

        state.PauseTiming();
        std::vector<T> nums = generate_input();
        state.ResumeTiming();

        T * n = &nums.front();
        T * m = &nums.front();
        const T thresh = T(THRESH);
        const T * back = &nums.back();

        switch (nums.size() % 4) {
            case 3:
                if (*m >= thresh){
                    *(n++) = std::move(*(m++));
                } else {
                    m++;
                }
            case 2:
                if (*m >= thresh){
                    *(n++) = std::move(*(m++));
                } else {
                    m++;
                }
            case 1:
                if (*m >= thresh){
                    *(n++) = std::move(*(m++));
                } else {
                    m++;
                }
        }

        do {

            if (*(m + 0) >= thresh){ *(n++) = std::move(*(m + 0)); }
            if (*(m + 1) >= thresh){ *(n++) = std::move(*(m + 1)); }
            if (*(m + 2) >= thresh){ *(n++) = std::move(*(m + 2)); }
            if (*(m + 3) >= thresh){ *(n++) = std::move(*(m + 3)); }

            m += 4;

        } while (m < back);

        nums.resize(n - &nums.front());

        benchmark::DoNotOptimize(nums);
      }
    }

};

template<typename T>
void benchmark_batch(size_t max_num) {

    std::string type = typeid(T).name();
    Foo<T>::generate_input(max_num);

    benchmark::RegisterBenchmark(
        std::string("just_remove/") + type, 
        Foo<T>::just_remove
    );
    benchmark::RegisterBenchmark(
        std::string("erase_remove/") + type, 
        Foo<T>::erase_remove
    );
    benchmark::RegisterBenchmark(
        std::string("resize_remove/") + type, 
        Foo<T>::resize_remove
    );
    benchmark::RegisterBenchmark(
        std::string("simple_remove/") + type, 
        Foo<T>::simple_remove
    );
    benchmark::RegisterBenchmark(
        std::string("simple_remove_unroll/") + type, 
        Foo<T>::simple_remove_unroll
    );
}

int main(int argc, char** argv) {
 
    benchmark_batch<uint8_t>(INT8_MAX); 
    benchmark_batch<uint32_t>(INT32_MAX);
    benchmark_batch<uint64_t>(INT64_MAX);

    benchmark_batch<Filler<uint8_t>>(INT8_MAX);
    benchmark_batch<Filler<uint32_t>>(INT32_MAX);
    benchmark_batch<Filler<uint64_t>>(INT64_MAX);

    benchmark::Initialize(&argc, argv);
    benchmark::RunSpecifiedBenchmarks();
    benchmark::Shutdown();

    return 0;
}

And the code to recreate the plot.

0..49 | % { .\Release\test_cxx.exe --benchmark_min_warmup_time=0.1 --benchmark_format=csv > "./data/run_$_.csv" }
Get-ChildItem ./data | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\benchmark.csv -NoTypeInformation -Append

import matplotlib.ticker as tck
import matplotlib.pyplot as plt
import csv

def read_data():
    data = {}
    test_len = 0
    with open('./build/benchmark.csv') as csvfile:
        reader  = csv.DictReader(csvfile)
        for row in reader :
            test_len += 1
            name = row['name'];
            if not name in data:
                data[name] = {
                    'min': {
                        'iterations': float(row['iterations']),
                        'real_time': float(row['real_time']),
                        'cpu_time': float(row['cpu_time']),
                    }, 
                    'max': {
                        'iterations': float(row['iterations']),
                        'real_time': float(row['real_time']),
                        'cpu_time': float(row['cpu_time']),
                    }, 
                    'avg': {
                        'iterations': float(row['iterations']),
                        'real_time': float(row['real_time']),
                        'cpu_time': float(row['cpu_time']),
                    },
                }
            else:
                for k in ['iterations', 'real_time', 'cpu_time']:
                    data[name]['avg'][k] += float(row[k])
                    if float(row[k]) < float(data[name]['min'][k]):
                        data[name]['min'][k] = float(row[k])
                    if float(row[k]) > float(data[name]['max'][k]):
                        data[name]['max'][k] = float(row[k])

        test_len /= len(data.keys())
        for k in data:
            for kk in data[k]['avg']:
                data[k]['avg'][kk] /= test_len
    return data

def plot_data(data, key):
    labels = []
    values = {
        'max': [], 'avg': [], 'min': [],
    }
    labels_struct = []
    values_struct = {
        'max': [], 'avg': [], 'min': [],
    }

    for k in list(data.keys()):
        if 'struct' in k or '6' in k:
            labels_struct.append(k.replace('/', '\n').replace('struct ', ''))
            values_struct['min'].append(data[k]['min'][key])
            values_struct['max'].append(data[k]['max'][key])
            values_struct['avg'].append(data[k]['avg'][key])
        else:
            labels.append(k.replace('/', '\n'))
            values['min'].append(data[k]['min'][key])
            values['max'].append(data[k]['max'][key])
            values['avg'].append(data[k]['avg'][key])

    return labels, values, labels_struct, values_struct

thickness = 0.8
benckmark_value = 'iterations'
colors = ['#1dad2b', '#af1c23', '#e0e0e0']

if __name__ == '__main__':

    data = read_data()
    labels, values, labels_struct, values_struct = plot_data(data, benckmark_value)

    fig = plt.figure(layout="constrained")
    spec = fig.add_gridspec(ncols=2, nrows=1)
    int_formater = lambda x, p: format(int(x), ',')

    ax0 = fig.add_subplot(spec[0, 0])
    ax0.set_ylabel(benckmark_value)
    ax0.set_title('std::vector<T>')
    ax0.set_xticklabels(labels, rotation=90)
    ax0.get_yaxis().set_major_formatter(tck.FuncFormatter(int_formater))

    ax1 = fig.add_subplot(spec[0, 1])
    ax1.set_ylabel(benckmark_value)
    ax1.set_title('std::vector<Filler<T>>')
    ax1.set_xticklabels(labels_struct, rotation=90)
    ax1.get_yaxis().set_major_formatter(tck.FuncFormatter(int_formater))

    for i, (k, v) in enumerate(values.items()):
        ax0.bar(labels, v, thickness, color=colors[i])

    for i, (k, v) in enumerate(values_struct.items()):
        ax1.bar(labels_struct, v, thickness, color=colors[i])

    plt.show()

If `std::vector::resize` was actually allocating + copying, we'd expect it to be measurably slower than `erase` on the tail after left-packing. https://en.cppreference.com/w/cpp/container/vector/resize says "Vector capacity is never reduced when resizing to smaller size because that would invalidate all iterators", so it's actually guaranteed *not* to reallocate. If you want that, you have to call `.shrink_to_fit()` — Peter Cordes, Jun 02 '23 at 22:40
(A C++ implementation with an allocator that supports a non-copying new/delete-compatible version of `realloc` to try to give back the later part of an allocation could do that extra work in a `vector::resize()`, since the actual requirement just comes from not invalidating iterators. But I think existing implementations don't, unfortunately. C++ new/delete is a garbage API that doesn't expose useful operations for large buffers, like trying to extend an existing allocation, or realloc that could wrap Linux `mremap`.) — Peter Cordes, Jun 02 '23 at 22:46
You are right, i assumed that [_Destroy_range](https://github.com/microsoft/STL/blob/main/stl/inc/vector#L1561) was using the allocator to change the vector, incidentally i check the number of constructor and destructor calls and i also didn't realize that `std::remove_if` does not call the destructor on remove; but then what makes that difference and `simple_remove` that much worse? — SrPanda, Jun 03 '23 at 11:04
Perhaps `std::remove_if` can compile to branchless code, but `simple_remove` doesn't? But that would probably be an even bigger difference. I don't know if Zen 1 has any performance quirks that could explain it. Maybe just luck of code alignment and which branches alias each other in the branch predictors in that loop vs. other loops? — Peter Cordes, Jun 03 '23 at 14:01
Seeing a `mov` get blamed for a lot of CPU time doesn't tell you much about why there was extra time; with many instructions in flight at once, just one has to get the blame when a perf counter for "cycles" overflows. Often the insn waiting for a slow result gets the blame, at least on Intel but IDK about AMD. And there can be "skids", where the instruction getting the blame is a couple instructions later than the one that was the oldest in the ROB when the event fired. — Peter Cordes, Jun 03 '23 at 14:03
That would make sense (to be a ryzen 1 thing), but running the test of u64 in [quick-bench](https://quick-bench.com/q/8cUnVxlHYqzU7wkAkMurSiVs3d0) shows a similar result and even then, why is it so much faster with structs? — SrPanda, Jun 03 '23 at 14:51
What compiler did you use, with what options, on your own machine? On quick-bench with the clang-13 build you linked, I re-ran the benchmark since "record disassembly" wasn't enabled, and `erase_remove` came out slower than `simple_remove`. (But `just_remove` and `resize_remove` were both fast.) Hrm, am I going blind, or did quick-bench remove the button or tab to see the asm? It's not there even after re-running with "record disassembly" and "clear cached result" checked. Oh, maybe that depended on Godbolt, and the godbolt button on quick-bench gives an error about too much load. — Peter Cordes, Jun 03 '23 at 14:59
I'used cmake with vs2022 and just /O2, I did leave the record disassembly on so I don't know. It did produce the same result for me, is it running the code locally? — SrPanda, Jun 03 '23 at 15:09
quick-bench compiles with the compiler + options you select in the drop-down (e.g. clang-13 `-O3`), and runs on their AWS instances of unknown CPU, but probably Skylake-X, Ice Lake, or Zen 3 server CPUs. It measures speed as a ratio vs. the speed of an empty loop to try to account for the unknowns in the CPU and in the load it's competing with, so the measurements can be somewhat noisy; there might be some signal as well as noise, though, since `simple_remove` came out a bit slower for both our runs, and your desktop. Check the "about quick-bench" in the "more" dropdown at the top right. — Peter Cordes, Jun 03 '23 at 15:13
Probably quick-bench's "record disassembly" is temporarily not working. Anyway, your local results from MSVC will be different asm than clang compiling for Linux, but https://godbolt.org/ has MSVC installed so we can probably check on that. Given the noisy results from quick-bench and your own result all showing `simple_remove` a bit slower, maybe there's some reason that compilers can't or don't optimize it as well, or maybe it's just bad luck (of alignment or some microarchitectural effect) striking twice. — Peter Cordes, Jun 03 '23 at 15:16
It was a vs problem, I've ended up borrowing a laptop just to check if zen1 was also causing issues, I've used same executable for both of the linux tests (compiled on the intel); but remains the mystery of why `simple_remove` is so much faster with structs on windows. — SrPanda, Jun 05 '23 at 17:03
MSVC is well known to have missed-optimizations more often than Clang or GCC. https://www.agner.org/forum/viewtopic.php?f=1&t=88&sid=d5bd9e501064b55a4b378ab3bb757127 - Agner mentions this, along with the fact that better-optimizing closed-source compilers like ICC have also fallen behind Clang and GCC on average, and Intel is moving to an LLVM-based compiler. But I think MSVC has historically made slower code than GCC in many cases, although they have had some improvements over the years to get less far behind. Anyway, unexpectedly slow code for some specific case from MSVC sounds normal. — Peter Cordes, Jun 05 '23 at 17:23

Why is 'resize-remove' faster than 'erase-remove' on vectors?

1 Answers1