92

I'm wondering about std::variant performance. When should I not use it? It seems like virtual functions are still much better than using std::visit which surprised me!

In "A Tour of C++" Bjarne Stroustrup says this about pattern checking after explaining std::holds_alternatives and the overloaded methods:

This is basically equivalent to a virtual function call, but potentially faster. As with all claims of performance, this ‘‘potentially faster’’ should be verified by measurements when performance is critical. For most uses, the difference in performance is insignificant.

I've benchmark some methods that came in my mind and these are the results:

benchmark http://quick-bench.com/N35RRw_IFO74ZihFbtMu4BIKCJg

You'll get a different result if you turn on the optimization:

benchmark with op enabled

http://quick-bench.com/p6KIUtRxZdHJeiFiGI8gjbOumoc

Here's the code I've used for benchmarks; I'm sure there's better way to implement and use variants for using them instead of virtual keywords (inheritance vs. std::variant):

removed the old code; look at the updates

Can anyone explain what is the best way to implement this use case for std::variant that got me to testing and benchmarking:

I'm currently implementing RFC 3986 which is 'URI' and for my use case this class will be used more as a const and probably won't be changed a lot and it's more likely for the user to use this class to find each specific portion of the URI rather than making a URI; so it made sense to make use of std::string_view and not separating each segment of the URI in its own std::string. The problem was I needed to implement two classes for it; one for when I only need a const version; and another one for when the user wants to create the URI rather than providing one and searching through it.

So I used a template to fix that which had its own problems; but then I realized I could use std::variant<std::string, std::string_view> (or maybe std::variant<CustomStructHoldingAllThePieces, std::string_view>); so I started researching to see if it actually helps to use variants or not. From these results, it seems like using inheritance and virtual is my best bet if I don't want to implement two different const_uri and uri classes.

What do you think should I do?


Update (2)

Thanks for @gan_ for mentioning and fixing the hoisting problem in my benchmark code. benchmark http://quick-bench.com/Mcclomh03nu8nDCgT3T302xKnXY

I was surprised with the result of try-catch hell but thanks to this comment that makes sense now.

Update (3)

I removed the try-catch method as it was really bad; and also randomly changed the selected value and by the looks of it, I see more realistic benchmark. It seems like virtual is not the correct answer after all. random access http://quick-bench.com/o92Yrt0tmqTdcvufmIpu_fIfHt0

http://quick-bench.com/FFbe3bsIpdFsmgKfm94xGNFKVKs (without the memory leak lol)

Update (4)

I removed the overhead of generating random numbers (I've already did that in the last update but it seems like I had grabbed the wrong URL for benchmark) and added an EmptyRandom for understanding the baseline of generating random numbers. And also made some small changes in Virtual but I don't think it affected anything. empty random added http://quick-bench.com/EmhM-S-xoA0LABYK6yrMyBb8UeI

http://quick-bench.com/5hBZprSRIRGuDaBZ_wj0cOwnNhw (removed the Virtual so you could compare the rest of them better)


Update (5)

as Jorge Bellon said in the comments, I wasn't thinking about the cost of allocation; so I converted every benchmark to use pointers. This indirection has an impact on performance of course but it's more fair now. So right now there's no allocation in the loops.

Here's the code:

removed the old code; look at the updates

I ran some benchmarks so far. It seems like g++ does a better job of optimizing the code:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
EmptyRandom                   0.756 ns        0.748 ns    746067433
TradeSpaceForPerformance       2.87 ns         2.86 ns    243756914
Virtual                        12.5 ns         12.4 ns     60757698
Index                          7.85 ns         7.81 ns     99243512
GetIf                          8.20 ns         8.18 ns     92393200
HoldsAlternative               7.08 ns         7.07 ns     96959764
ConstexprVisitor               11.3 ns         11.2 ns     60152725
StructVisitor                  10.7 ns         10.6 ns     60254088
Overload                       10.3 ns         10.3 ns     58591608

And for clang:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
EmptyRandom                    1.99 ns         1.99 ns    310094223
TradeSpaceForPerformance       8.82 ns         8.79 ns     87695977
Virtual                        12.9 ns         12.8 ns     51913962
Index                          13.9 ns         13.8 ns     52987698
GetIf                          15.1 ns         15.0 ns     48578587
HoldsAlternative               13.1 ns         13.1 ns     51711783
ConstexprVisitor               13.8 ns         13.8 ns     49120024
StructVisitor                  14.5 ns         14.5 ns     52679532
Overload                       17.1 ns         17.1 ns     42553366

Right now, for clang, it's better to use virtual inheritance but for g++ it's better to use holds_alternative or get_if but in overall, std::visit seems to be not a good choice for almost all of my benchmarks so far.

I'm thinking it'll be a good idea if pattern matching (switch statements capable of checking more stuff than just integer literals) would be added to the c++, we would be writing cleaner and more maintainable code.

I'm wondering about the package.index() results. Shouldn't it be faster? what does it do?

Clang version: http://quick-bench.com/cl0HFmUes2GCSE1w04qt4Rqj6aI

The version that uses One one instead of auto one = new One based on Maxim Egorushkin's comment: http://quick-bench.com/KAeT00__i2zbmpmUHDutAfiD6-Q (not changing the outcome much)


Update (6)

I made some changes and the results are very different from compiler to compiler now. But it seems like std::get_if and std::holds_alternatives are the best solutions. virtual seems to work best for unknown reasons with clang now. That really surprises me there because I remember virtual being better in gcc. And also std::visit is totally out of competition; in this last benchmark it's even worse than vtable lookup.

Here's the benchmark (run it with GCC/Clang and also with libstdc++ and libc++):

http://quick-bench.com/LhdP-9y6CqwGxB-WtDlbG27o_5Y

#include <benchmark/benchmark.h>

#include <array>
#include <variant>
#include <random>
#include <functional>
#include <algorithm>

using namespace std;

struct One {
  auto get () const { return 1; }
 };
struct Two {
  auto get() const { return 2; }
 };
struct Three { 
  auto get() const { return 3; }
};
struct Four {
  auto get() const { return 4; }
 };

template<class... Ts> struct overload : Ts... { using Ts::operator()...; };
template<class... Ts> overload(Ts...) -> overload<Ts...>;


std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> random_pick(0,3); // distribution in range [1, 6]

template <std::size_t N>
std::array<int, N> get_random_array() {
  std::array<int, N> item;
  for (int i = 0 ; i < N; i++)
    item[i] = random_pick(rng);
  return item;
}

template <typename T, std::size_t N>
std::array<T, N> get_random_objects(std::function<T(decltype(random_pick(rng)))> func) {
    std::array<T, N> a;
    std::generate(a.begin(), a.end(), [&] {
        return func(random_pick(rng));
    });
    return a;
}


static void TradeSpaceForPerformance(benchmark::State& state) {
    One one;
    Two two;
    Three three;
    Four four;

  int index = 0;

  auto ran_arr = get_random_array<50>();
  int r = 0;

  auto pick_randomly = [&] () {
    index = ran_arr[r++ % ran_arr.size()];
  };

  pick_randomly();


  for (auto _ : state) {

    int res;
    switch (index) {
      case 0:
        res = one.get();
        break;
      case 1:
        res = two.get();
        break;
      case 2:
        res = three.get();
        break;
      case 3:
        res = four.get();
        break;
    }
    
    benchmark::DoNotOptimize(index);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }


}
// Register the function as a benchmark
BENCHMARK(TradeSpaceForPerformance);


static void Virtual(benchmark::State& state) {

  struct Base {
    virtual int get() const noexcept = 0;
    virtual ~Base() {}
  };

  struct A final: public Base {
    int get()  const noexcept override { return 1; }
  };

  struct B final : public Base {
    int get() const noexcept override { return 2; }
  };

  struct C final : public Base {
    int get() const noexcept override { return 3; }
  };

  struct D final : public Base {
    int get() const noexcept override { return 4; }
  };

  Base* package = nullptr;
  int r = 0;
  auto packages = get_random_objects<Base*, 50>([&] (auto r) -> Base* {
          switch(r) {
              case 0: return new A;
              case 1: return new B;
              case 3: return new C;
              case 4: return new D;
              default: return new C;
          }
    });

  auto pick_randomly = [&] () {
    package = packages[r++ % packages.size()];
  };

  pick_randomly();

  for (auto _ : state) {

    int res = package->get();

    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }


  for (auto &i : packages)
    delete i;

}
BENCHMARK(Virtual);




static void FunctionPointerList(benchmark::State& state) {

    One one;
    Two two;
    Three three;
    Four four;
  using type = std::function<int()>;
  std::size_t index;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
        case 0: return std::bind(&One::get, one);
        case 1: return std::bind(&Two::get, two);
        case 2: return std::bind(&Three::get, three);
        case 3: return std::bind(&Four::get, four);
        default: return std::bind(&Three::get, three);
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    index = r++ % packages.size();
  };


  pick_randomly();

  for (auto _ : state) {

    int res = packages[index]();

    benchmark::DoNotOptimize(index);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }

}
BENCHMARK(FunctionPointerList);



static void Index(benchmark::State& state) {

    One one;
    Two two;
    Three three;
    Four four;
  using type = std::variant<One, Two, Three, Four>;
  type* package = nullptr;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
            case 0: return one;
            case 1: return two;
            case 2: return three;
            case 3: return four;
            default: return three;
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    package = &packages[r++ % packages.size()];
  };


  pick_randomly();

  for (auto _ : state) {

    int res;
    switch (package->index()) {
      case 0: 
        res = std::get<One>(*package).get();
        break;
      case 1:
        res = std::get<Two>(*package).get();
        break;
      case 2:
        res = std::get<Three>(*package).get();
        break;
      case 3:
        res = std::get<Four>(*package).get();
        break;
    }

    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }

}
BENCHMARK(Index);



static void GetIf(benchmark::State& state) {
    One one;
    Two two;
    Three three;
    Four four;
  using type = std::variant<One, Two, Three, Four>;
  type* package = nullptr;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
            case 0: return one;
            case 1: return two;
            case 2: return three;
            case 3: return four;
            default: return three;
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    package = &packages[r++ % packages.size()];
  };

  pick_randomly();

  for (auto _ : state) {

    int res;
    if (auto item = std::get_if<One>(package)) {
      res = item->get();
    } else if (auto item = std::get_if<Two>(package)) {
      res = item->get();
    } else if (auto item = std::get_if<Three>(package)) {
      res = item->get();
    } else if (auto item = std::get_if<Four>(package)) {
      res = item->get();
    }

    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }
  

}
BENCHMARK(GetIf);

static void HoldsAlternative(benchmark::State& state) {
    One one;
    Two two;
    Three three;
    Four four;
  using type = std::variant<One, Two, Three, Four>;
  type* package = nullptr;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
            case 0: return one;
            case 1: return two;
            case 2: return three;
            case 3: return four;
            default: return three;
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    package = &packages[r++ % packages.size()];
  };

  pick_randomly();

  for (auto _ : state) {

    int res;
    if (std::holds_alternative<One>(*package)) {
      res = std::get<One>(*package).get();
    } else if (std::holds_alternative<Two>(*package)) {
      res = std::get<Two>(*package).get();
    } else if (std::holds_alternative<Three>(*package)) {
      res = std::get<Three>(*package).get();
    } else if (std::holds_alternative<Four>(*package)) {
      res = std::get<Four>(*package).get();
    }

    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }

}
BENCHMARK(HoldsAlternative);


static void ConstexprVisitor(benchmark::State& state) {

    One one;
    Two two;
    Three three;
    Four four;
  using type = std::variant<One, Two, Three, Four>;
  type* package = nullptr;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
            case 0: return one;
            case 1: return two;
            case 2: return three;
            case 3: return four;
            default: return three;
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    package = &packages[r++ % packages.size()];
  };

  pick_randomly();

  auto func = [] (auto const& ref) {
        using type = std::decay_t<decltype(ref)>;
        if constexpr (std::is_same<type, One>::value) {
            return ref.get();
        } else if constexpr (std::is_same<type, Two>::value) {
            return ref.get();
        } else if constexpr (std::is_same<type, Three>::value)  {
          return ref.get();
        } else if constexpr (std::is_same<type, Four>::value) {
            return ref.get();
        } else {
          return 0;
        }
    };

  for (auto _ : state) {

    auto res = std::visit(func, *package);

    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }

}
BENCHMARK(ConstexprVisitor);

static void StructVisitor(benchmark::State& state) {

  

  struct VisitPackage
  {
      auto operator()(One const& r) { return r.get(); }
      auto operator()(Two const& r) { return r.get(); }
      auto operator()(Three const& r) { return r.get(); }
      auto operator()(Four const& r) { return r.get(); }
  };

    One one;
    Two two;
    Three three;
    Four four;
  using type = std::variant<One, Two, Three, Four>;
  type* package = nullptr;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
            case 0: return one;
            case 1: return two;
            case 2: return three;
            case 3: return four;
            default: return three;
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    package = &packages[r++ % packages.size()];
  };

  pick_randomly();

  auto vs = VisitPackage();

  for (auto _ : state) {

    auto res = std::visit(vs, *package);

    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }

}
BENCHMARK(StructVisitor);


static void Overload(benchmark::State& state) {


    One one;
    Two two;
    Three three;
    Four four;
  using type = std::variant<One, Two, Three, Four>;
  type* package = nullptr;

  auto packages = get_random_objects<type, 50>([&] (auto r) -> type {
        switch(r) {
            case 0: return one;
            case 1: return two;
            case 2: return three;
            case 3: return four;
            default: return three;
        }
    });
  int r = 0;

  auto pick_randomly = [&] () {
    package = &packages[r++ % packages.size()];
  };

  pick_randomly();

  auto ov = overload {
      [] (One const& r) { return r.get(); },
      [] (Two const& r) { return r.get(); },
      [] (Three const& r) { return r.get(); },
      [] (Four const& r) { return r.get(); }
    };

  for (auto _ : state) {

    auto res = std::visit(ov, *package);

  
    benchmark::DoNotOptimize(package);
    benchmark::DoNotOptimize(res);

    pick_randomly();
  }

}
BENCHMARK(Overload);


// BENCHMARK_MAIN();

Results for GCC compiler:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TradeSpaceForPerformance       3.71 ns         3.61 ns    170515835
Virtual                       12.20 ns        12.10 ns     55911685
FunctionPointerList           13.00 ns        12.90 ns     50763964
Index                          7.40 ns         7.38 ns    136228156
GetIf                          4.04 ns         4.02 ns    205214632
HoldsAlternative               3.74 ns         3.73 ns    200278724
ConstexprVisitor              12.50 ns        12.40 ns     56373704
StructVisitor                 12.00 ns        12.00 ns     60866510
Overload                      13.20 ns        13.20 ns     56128558

Results for clang compiler (which I'm surprised by it):

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
TradeSpaceForPerformance       8.07 ns         7.99 ns     77530258
Virtual                        7.80 ns         7.77 ns     77301370
FunctionPointerList            12.1 ns         12.1 ns     56363372
Index                          11.1 ns         11.1 ns     69582297
GetIf                          10.4 ns         10.4 ns     80923874
HoldsAlternative               9.98 ns         9.96 ns     71313572
ConstexprVisitor               11.4 ns         11.3 ns     63267967
StructVisitor                  10.8 ns         10.7 ns     65477522
Overload                       11.4 ns         11.4 ns     64880956

Best benchmark so far (will be updated): http://quick-bench.com/LhdP-9y6CqwGxB-WtDlbG27o_5Y (also check out the GCC)

Community
  • 1
  • 1
The Moisrex
  • 1,857
  • 1
  • 14
  • 16
  • 53
    "You'll get a different result if you turn on the optimization": measuring performance in a non-optimized build is pretty much pointless. – bolov Aug 30 '19 at 12:06
  • 9
    I am tempted to vote to close as "opinion based", but I will refrain because of the amount of work put into the question. Hope you get a good answer. – bolov Aug 30 '19 at 12:08
  • @bolov yes, that's why I mentioned it; I couldn't get the compiler not to optimize the part that mattered. it seems like they're using other techniques that are not relative to what I intended to find out. – The Moisrex Aug 30 '19 at 12:09
  • 2
    In 5 of your bench-marked implementations the compiler is able to hoist the visit outside of the loop. [You need to use `DoNotOptimize` on the variant itself](http://quick-bench.com/Mcclomh03nu8nDCgT3T302xKnXY) if you want to bench the unboxing. – gan_ Aug 30 '19 at 12:25
  • @gan_ thanks; I've updated the question to include your version as well. – The Moisrex Aug 30 '19 at 12:32
  • 6
    Your `Virtual` benchmark may not actually use dynamic dispatch as the compiler may be able to trace the call graph and devirtualize the call to `get`. A nice explanation is given in [this answer](https://stackoverflow.com/a/36898879/2788450). With gcc you can provide the flag `-fno-devirtualize` to prevent this specific optimization while still not giving up on optimizations entirely. – Jonas Greitemann Aug 30 '19 at 12:32
  • @Jonas yes. I saw the differences while messing with the settings but I saw that clang actually doesn't do that. I saw the result of gcc doing that and its benchmark was the same as Noop in the graphs. – The Moisrex Aug 30 '19 at 12:35
  • 3
    @moisrex One issue in your benchmark code is that you always use the first alternative (you store a `One`). The try/catch alternative is basically made to be very fast when the first alternative is chosen. If I change your code to hold the third alternative (`Three`), the try/catch becomes thousands time slower than any other versions. http://quick-bench.com/yIDclCuxfzjzBlCjbVCDNrp13aU – Holt Aug 30 '19 at 12:52
  • @Holt I knew there was a problem with that one; thanks for point it out; Updated the question. – The Moisrex Aug 30 '19 at 13:05
  • 1
    @moisrex: "*it seems like using inheritance and virtual is my best bet*" If you like heap allocations. `variant` is *explicitly disallowed* from allocating memory. Your benchmark doesn't test the cost of creating such objects. And of course, no benchmark can test the cost of having to track the ownership of such heap-allocated resources. – Nicol Bolas Aug 30 '19 at 15:54
  • @NicolBolas yes, that's before I reach to the update 4's benchmark. I see that now. I also should benchmark `std::in_place` and related `emplace` for completeness. – The Moisrex Aug 30 '19 at 16:25
  • Individual parts of an URI are short strings. Why don't you rely on short string optimization and be done with it? – Michaël Roy Sep 01 '19 at 23:07
  • @MichaëlRoy SSO is usually about 16 or 20 chars and honestly, if you just look at the URL in this very page, it's longer than that. I hardly can put IPv4 address in it, let alone IPv6 or a full path that we all know how creative developers can be in making long slugs – The Moisrex Sep 01 '19 at 23:35
  • I understand. I personally parse IETF strings using the BNF definitions and using standalone constexpr functions. string_view is used throughout. Storage is provided by keeping a shared_ptr of the source packet in the class that parses my packets. Parsing a isolated URL (from config, usually) doesn't happen often in my phone recording applications, I make sure to use the parsed result immediately in these cases. The use of constexpr BNF subparsers takes care of const inputs, I have to admit haven't had a use case for parsing a const input URL in the last 6 years. – Michaël Roy Sep 02 '19 at 00:15
  • 1
    Your _best benchmark so far_ comparison is unfair. Your results are biased because you put too much pressure on the allocator which is why you see it run slower (you are allocating in the heap plus doing the virtual function call). If you move `new` and `delete` out of the loop, the results are a bit more realistic. See http://quick-bench.com/aHzAIn3xRZfiTSWbuLu6ZhoyRoo – Jorge Bellon Sep 16 '19 at 10:16
  • 1
    You should use pointers without memory allocations. E.g. replace `auto one = new A;` with `A one;` and just take a pointer with `&one`. – Maxim Egorushkin Sep 16 '19 at 14:33
  • @MaximEgorushkin thanks for suggesting, but not seeing much difference in the overall result. – The Moisrex Sep 16 '19 at 14:51
  • It would be nice to specify in your question which version of which compiler you are using for your benchmarks. I think the implementation of `std::visit` in libstdc++ changed not that long ago. And clang could be using libc++ or libstdc++. – Marc Glisse Sep 16 '19 at 15:03
  • @MarcGlisse actually quick-bench has that and you can play around with it. I've tested with clang and gcc and so far gcc does a better job of optimizing it. for example gcc 8.3 is confusing me here: http://quick-bench.com/CjJbM_EjviATU4h8EvsqBnUeXWA – The Moisrex Sep 16 '19 at 15:14
  • It seems like you may be making the problem harder than it needs to be. Have you considered embedding a vector in the URI class? It'll add a few bytes to your base class in the constant form, but would let you do things like: `class URI { std::string_view scheme_; std::vector pieces_; void set_scheme(std::string scheme) { scheme_ = pieces_.emplace_back(std::move(scheme)); scheme_ = pieces_.back(); } };` All of your get functions and reassembly functions would end up being the same. Biggest downside is someone constantly resetting a field and constantly appending strings. – Charlie Sep 17 '19 at 06:54
  • As far as I understood it, the use case for `std::variant` is to provide access to distinct classes which do not share a common interface. For example: `std::variant`. Why are you trying to replace relatively fast `virtual` calls with `std::variant`? Only because it could be faster? – jan.sende Sep 17 '19 at 19:56
  • @Charlie that's what I thought I should do first, but I saw that `std::variant` is designed to solve this very issue. Why not use it when it's the right thing to do? – The Moisrex Sep 17 '19 at 20:12
  • @jan.sende that's the point of this question! is `virtual` faster than using `std::variant`? so far I've gotten different results which don't imply nor denies the performance benefits of using each methods. – The Moisrex Sep 17 '19 at 20:16
  • @moisrex Oh! So it more theoretical, than useful :P By the way, I played around with your code. Your usage of `std::variant` is very weird! I think you are at the boundaries of the implementation there. A `std::variant` of pointers is not how it's supposed to be used... but have fun! – jan.sende Sep 17 '19 at 20:49
  • @jan.sende I'm not using it like pointers in my own code, I just did that here so it would be more fair benchmark. – The Moisrex Sep 17 '19 at 21:01
  • I don't understand your implementation of `Virtual`. Why do you use a switch? The whole point of virtual functions is that you don't need to `switch`. I modified your benchmark, and now `Virtual` is as fast as `EmptyRandom` (http://quick-bench.com/1Ty62gg2lpeeaTmx_sHLC6sLfjs). Furthermore, your benchmark is a little bit flawed, as it has a `%` operation. It distorts the results, as it adds a constant calculation to every benchmark. You should use `&` instead, it is faster. Or use a power-of-2 random array size. – geza Sep 17 '19 at 21:23
  • And this whole thing depends on a lot of parameters. Is the order completely random? You use a 50 long sequence, which current CPU's can learn, so they can utilize perfect branch prediction. So, for short sequences, actual branches win. But, for longer sequences, where CPU cannot learn the sequence, jump tables may win. This thing depends on a lot of factors, so I don't think that this question can be answered. – geza Sep 17 '19 at 21:28
  • TradeSpaceForPerformance likely is optimising out of being relevant. I've added a RawRetrieve for comparison with TradeSpaceForPerformance. The result of Get is certainly determined at compile time and its just measuring the time to save to res and pick_randomly(). http://quick-bench.com/6X9sIgsK_QvEwS95UQ50asECxMk – David Ledger Sep 18 '19 at 12:24
  • @geza What you did in your benchmark is good but not possible for others. You've optimized the pick_randomly function for Virtual but not for the other ones. The reason I wrote the `EmptyRandom` was so I could know how much that function is actually is taking. Your way is just not fair to the others. I might do the same thing for the other ones as well and then the results would be more fair. – The Moisrex Sep 18 '19 at 20:00
  • @moisrex: Maybe I just don't understand your intents. You should have included that `switch` in `EmptyRandom` as well, if you just want to measure the cost of virtual dispatch. For virtual functions to work, a `switch` doesn't needed. That's the whole point of virtual functions. – geza Sep 18 '19 at 20:18
  • @moisrex: btw, a potential fastest possible is not on your list: function pointer table. At least, in theory, it should be faster than virtual functions, because virtual functions need an additional indirection compared to function pointers. – geza Sep 18 '19 at 20:22
  • @geza I updated the code; added the function pointer list but I did it with `std::function` instead of c-style version. I was sure it wouldn't be faster even before I made it. But of course c-style version would beat the other ones but that is not the point of this benchmark. – The Moisrex Sep 18 '19 at 21:53
  • @moisrex: of course, `std::function` is a different thing than a function pointer. I'm glad that you mentioned the point of the benchmark. What is the point? I don't really get it. It seems that you intentionally make versions worse (like adding unneeded `switch` to the virtual version), and you don't measure a possible good option. Btw., you should check out the resulting assembly. Compilers are pretty smart nowadays, maybe they optimize several of your versions to a point, where you don't measure what you wanted to measure. – geza Sep 18 '19 at 22:12
  • @geza my point is to see if it is better to use `std::variant` over virtual inheritance and if it is, then which way of accessing `std::variant` is the best. My initial intention was about making a URI class that holds `std::string_view` and when the user needed, parses the string and makes the struct so we wouldn't need to parse something that user doesn't need and most certainly uses less memory. I've updated the question since you last checked the code. – The Moisrex Sep 18 '19 at 22:23
  • I suppose you meant virtual functions (virtual inheritance is a different thing). Anyways, I cannot answer your question. So if nobody answers it, I recommend you to put a more clear, simpler question, and delete this one. This question became very confusing because of the lot of edits (it is not surprising that nobody has answered yet, even the bounty, and a lot of upvotes... For me, this is definitely "Unclear what you're asking") – geza Sep 18 '19 at 22:40
  • Personally, I would avoid `std::variant` where performance is concerned. Having to wrap code in `try`/`catch` adds a performance penalty - especially when you know that `catch` is likely to be called. Then again, I prefer old fashioned `C`, so I might be biased here (I would return NULL / pointer for URI data). – Myst Sep 20 '19 at 17:04
  • @Myst In the benchmarks above I showed many ways to use `std::variant` and `try-cache-hell` was just one of them that I had to remove from benchmarks so I could compare the charts more clearly. Read the comments of this question. – The Moisrex Sep 20 '19 at 18:06
  • It's worth noting that `std::visit`'s implementation is not the best (libstdc++ and libc++, I think the Microsoft STL uses the optimizations mentioned). See Michael Park's blog post: https://mpark.github.io/programming/2019/01/22/variant-visitation-v2/ . I know that there's an implementation kicking around somewhere too (I think Michael did it along with someone else). – Justin Sep 21 '19 at 01:51
  • I think it's interesting to compare performance of variant/virtual, but if performance is this much of a concern, my gut feeling is that you shouldn't use either. When you visit a variant, that is, in essence, a form of dynamic binding. Consider that fast STL implementations will probably avoid dynamic binding. Why did you abandon the templated version? Is there a benchmark that might compare the performance of your templated version? – Humphrey Winnebago Sep 23 '19 at 06:25
  • @HumphreyWinnebago I think `std::optional`, `std::any`, and `std::variant` in c++17 are good opportunities to introduce new programming techniques and deprecate or at least use them when they make more sense to use and I'm looking for these kinda techniques so I could migrate from using `virtual` keyword to a more constant evaluated, faster, templated techniques. And finding out the performance benefits of each is essential to understand which one should be used where. – The Moisrex Sep 23 '19 at 12:56
  • 1
    The performance between the two is similar. Isn't it a distracting / misleading / irrelevant comparison? You probably won't see much difference compared to other optimizations you could make. std::visit isn't constant evaluated. It is on par with virtual aside from the exact details of implementation and use. virtual is used for expressiveness, not performance. std::variant too. Compare use cases instead. Do you want to write a library that can be extended without having to be recompiled? Use virtual. Do you want to avoid pointers? Have unrelated types? Want a union? Use variant. – Humphrey Winnebago Sep 23 '19 at 21:32
  • @moisrex: May I humbly suggest my `AndyG::heterogeneous_container` that works like a visitor pattern, but groups the data by type into vectors. My own benchmarks had it much faster than `std::any` at least, and I imagine it may be faster than a variant as well. https://gieseanw.wordpress.com/2017/05/03/a-true-heterogeneous-container-in-c/ – AndyG Sep 24 '19 at 13:39
  • I went ahead and ran the benchmarks myself. You can find them here: http://quick-bench.com/sDuqimGAeZmuy2KhhEpd0my8_d0, (I couldn't run all of them at once or else it times out). The heterogeneous container doesn't quite support individual item visitation; only visiting the entire collection (which is primarily what happens most often when one wishes to visit). As you can see in the results, it runs about 4x faster than every other approach. (Again, when you wish to visit ALL elements). – AndyG Sep 24 '19 at 17:20
  • @AndyG Your benchmark is being optimized out somehow, I changed the return value of `get` methods and now your way is not so much better than the others. I wonder if it has an impact if you use function pointers instead of `std::function`. http://quick-bench.com/bgrIxAbc-Wd9G1UghwKSzf5UunQ – The Moisrex Sep 24 '19 at 18:04
  • Indeed it does seem to be more on par with the other approaches w/ the changed return ("TradeSpaceForPerformance" is kind of a moot benchmark in this approach because the result is deterministic w/ the index). – AndyG Sep 24 '19 at 19:41
  • You are qualifying with `std::` anyway, so why not remove `using namespace std;` ? – L. F. Sep 29 '19 at 06:05
  • @L.F. it's a habit; plus, it absolutely has no performance benefits nor it makes the code easier to understand. – The Moisrex Sep 29 '19 at 13:10
  • @moisrex Well, no. It does active harm - it causes name clashes and triggers ADL edge cases. It is probably beneficial for you to check out [Why is “using namespace std;” considered bad practice?](https://stackoverflow.com/q/1452721) before (potentially) going against the C++ community. – L. F. Sep 29 '19 at 13:29
  • @L.F. I am aware of that. I don't use `using namespace std` in my codes, just in benchmarks and test codes that I just want to test something. It's not important here. There are way more important issues in the benchmark above than using `using` statement. – The Moisrex Sep 29 '19 at 13:32
  • @moisrex OK then, it's fine in small programs as long as you know the consequence. (But even then, you are explicitly qualifying names, so ...) – L. F. Sep 29 '19 at 13:33
  • 1
    Quick comment on this, this issue seems to still be true with clang 14.0, c++20 and -O2/O3. On GCC 12.2, though, the performance for the overload and constexpr visitor are actually the fastest ones. I came up with [a simple implementation](https://quick-bench.com/q/ahvxYj2RX2FjG7_pk7XTDozFNdA) using `std::get_if` that still allows for the visitor syntax for cases where you don't care about handling all the types or returning a value from visit that is faster for clang too, just for reference. – Euller Borges Oct 28 '22 at 16:21

4 Answers4

20

std::visit seems to lack some optimizations yet on some implementations. That being said there's a central point thats not very well seen in this lab-like setup - which is that based design is stack based vs. the virtual pattern which will naturally gravitate towards being heap based. In a real world scenario this means the memory layout could very well be fragmented (perhaps over time - once objects leave the cache, etc.) - unless it can somehow be avoided. The opposite is the based design that can be layout in contigoues memory. I believe this is an extremely important point to consider when performance is concerned that cannot be underestimated.

To illustrate this, consider the following:

std::vector<Base*> runtime_poly_;//risk of fragmentation

vs.

std::vector<my_var_type> cp_time_poly_;//no fragmentation (but padding 'risk')

This fragmentation is somewhat difficult to built into a benchmark test like this one. If this is (also) within the context of bjarne's statement is not clear to me when he said it could potentially be faster (which I do believe holds true).

Another very important thing to remember for the std::variant based design is that the size of each element uses up the size of the largest possible element. Therefore if objects do not have roughly the same size this has to be considered carefully since it may have a bad impact on the cache as a result.

Considering these points together it's hard to say which is best to use in the general case - however it should be clear enough if the set is a closed 'smallish' one of roughtly the same size - then the variant style shows great potential for being faster (as bjarne notes).

We now only considered performance and there are indeed other reasons for choosing one or the other pattern: In the end, you just have to get out the comfort of the 'lab' and design and benchmark your real world use cases.

darune
  • 10,480
  • 2
  • 24
  • 62
  • This may be a silly question, but why do you suggest `std::vector` in the example, instead of `std::vector`? – luizfls Apr 21 '21 at 06:17
  • I think it is for space saving since better to create a vector of pointers pointing to heap allocated objects rather than objects itself. – Sankalp Pandya May 19 '21 at 09:21
  • 3
    @luizfls `std::vector` is not polymorphic – darune Aug 05 '21 at 18:06
  • 1
    @luizfls lookup "Object slicing in C++" to understand the usage of `Base*`. In short using Base would mean everything not in Base but in its derived classes is "sliced off". – user1556435 Aug 31 '21 at 13:26
6

You can match them all with a visit implementation if you can guarantee that the variant will never be empty by exception. Here is a single visitation visitor that matches the virtual above and inlines very well with jmp tables. https://gcc.godbolt.org/z/kkjACx

struct overload : Fs... {
  using Fs::operator()...;
};

template <typename... Fs>
overload(Fs...) -> overload<Fs...>;

template <size_t N, typename R, typename Variant, typename Visitor>
[[nodiscard]] constexpr R visit_nt(Variant &&var, Visitor &&vis) {
  if constexpr (N == 0) {
    if (N == var.index()) {
      // If this check isnt there the compiler will generate
      // exception code, this stops that
      return std::forward<Visitor>(vis)(
          std::get<N>(std::forward<Variant>(var)));
    }
  } else {
    if (var.index() == N) {
      return std::forward<Visitor>(vis)(
          std::get<N>(std::forward<Variant>(var)));
    }
    return visit_nt<N - 1, R>(std::forward<Variant>(var),
                              std::forward<Visitor>(vis));
  }
  while (true) {
  }  // unreachable but compilers complain
}

template <class... Args, typename Visitor, typename... Visitors>
[[nodiscard]] constexpr decltype(auto) visit_nt(
    std::variant<Args...> const &var, Visitor &&vis, Visitors &&... visitors) {
  auto ol =
      overload{std::forward<Visitor>(vis), std::forward<Visitors>(visitors)...};
  using result_t = decltype(std::invoke(std::move(ol), std::get<0>(var)));

  static_assert(sizeof...(Args) > 0);
  return visit_nt<sizeof...(Args) - 1, result_t>(var, std::move(ol));
}

template <class... Args, typename Visitor, typename... Visitors>
[[nodiscard]] constexpr decltype(auto) visit_nt(std::variant<Args...> &var,
                                                Visitor &&vis,
                                                Visitors &&... visitors) {
  auto ol =
      overload(std::forward<Visitor>(vis), std::forward<Visitors>(visitors)...);
  using result_t = decltype(std::invoke(std::move(ol), std::get<0>(var)));

  static_assert(sizeof...(Args) > 0);
  return visit_nt<sizeof...(Args) - 1, result_t>(var, std::move(ol));
}

template <class... Args, typename Visitor, typename... Visitors>
[[nodiscard]] constexpr decltype(auto) visit_nt(std::variant<Args...> &&var,
                                                Visitor &&vis,
                                                Visitors &&... visitors) {
  auto ol =
      overload{std::forward<Visitor>(vis), std::forward<Visitors>(visitors)...};
  using result_t =
      decltype(std::invoke(std::move(ol), std::move(std::get<0>(var))));

  static_assert(sizeof...(Args) > 0);
  return visit_nt<sizeof...(Args) - 1, result_t>(std::move(var), std::move(ol));
}

template <typename Value, typename... Visitors>
inline constexpr bool is_visitable_v = (std::is_invocable_v<Visitors, Value> or
                                        ...);

You call it with the variant first, followed by the visitors. Here is the Update 6 quickbench with it added Quickbench benchmark showing performance of  visit_nt. A link to the bench is here http://quick-bench.com/98aSbU0wWUsym0ej-jLy1POmCBw

So with that, I think the decision of whether or not to visit comes down to what is more expressive and clear in intent. The performance can be achieved either way.

Beached
  • 1,608
  • 15
  • 18
4

Based on Update 6 http://quick-bench.com/LhdP-9y6CqwGxB-WtDlbG27o_5Y

I guess we can't compare time but relative to each other results seems different enough to show choices in the library implementation.

  • Visual 2019 v16.8.3

  • cl 19.28.29335 x64

  • compile in /std:c++17

     Run on (8 X 3411 MHz CPU s)
        CPU Caches:
        L1 Data 32 KiB (x4)
        L1 Instruction 32 KiB (x4)
        L2 Unified 256 KiB (x4)
        L3 Unified 8192 KiB (x1)
    
     -------------------------------------------------------------------
     Benchmark                         Time             CPU   Iterations
     -------------------------------------------------------------------
     TradeSpaceForPerformance       5.41 ns         5.47 ns    100000000
     Virtual                        11.2 ns         10.9 ns     56000000
     FunctionPointerList            13.2 ns         13.1 ns     56000000
     Index                          4.37 ns         4.37 ns    139377778
     GetIf                          4.79 ns         4.87 ns    144516129
     HoldsAlternative               5.08 ns         5.16 ns    100000000
     ConstexprVisitor               4.16 ns         4.14 ns    165925926
     StructVisitor                  4.26 ns         4.24 ns    165925926
     Overload                       4.21 ns         4.24 ns    165925926
    
ColdCat
  • 1,192
  • 17
  • 29
2

I added AutoVisit and ConstVisit here: https://quick-bench.com/q/0aaZvQ0jQ0msy_-VrxgFTlbYBBY

    auto res = std::visit([](auto && v) { return v.get(); }, *package);

This is by far the shortest solution.

and also moved all the random init stuff into a macro to improve readability of the various implementations.

Willem Hengeveld
  • 2,758
  • 23
  • 19