How to compel a compiler to generate a code equivalent for a manual switch?

Question

The task is simple: we have a type list (i.e. using list = std::tuple<A, B, C, ...>;) and we want to dispatch its contents based on an index known at runtime. By "dispatch" I mean to call a templated handler, parameterized by a type from this list. It's easy to write a switch manually:

template <class... Args>
auto dispatch(size_t i, Args&& ...args)
{
  switch (i) {
    case 0: return handler<typename std::tuple_element<0, list>::type>(std::forward<Args>(args)...);
    ...
  }
}

But it's tedious, error prone and boring. I'm looking for a way to compel a compiler to do it from some more compact and terse definition.

And I want to achieve the result identical or very close to the manual switch. So, generating a table of function pointers (like here) is not a desired solution (I don't want to introduce a function call for cases where handlers are perfectly inlinable).

So far, I've found two ways:

Inlinable recursion (thanks to Horstling)

template <size_t N>
struct dispatch_helper
{
    template <class... Args>
    static R dispatch(size_t i, Args&& ...args)
    {
        if (N == i)
            return handler<typename std::tuple_element<N, list>::type>(std::forward<Args>(args)...);
        return dispatch_helper<N+1>::dispatch(i, std::forward<Args>(args)...);
    }
};

template <>
struct dispatch_helper<200>
{
    template <class... Args>
    static R dispatch(size_t type, Args&& ...args)
    {
        return {};
    }
};

template <class... Args>
auto dispatch_recursion(size_t i, Args&& ...args)
{
    return dispatch_helper<0>::dispatch(i, std::forward<Args>(args)...);
}

Usage of std::initializer_list

template <class L>
struct Iterator;

template <template <class...> class L, class... T>
struct Iterator<L<T...>>
{
    template <class... Args>
    static R dispatch(size_t i, Args&& ...args)
    {
        R res;
        std::initializer_list<int> l {
            (
                i == T::value ? (res = handler<typename T::type>(std::forward<Args>(args)...), 0) : 0
            )...
        };
        (void)l;
        return res;
    }
};

template <class... Args>
auto dispatch_type_list_iter(size_t i, Args&& ...args)
{
    return Iterator<index_list<list>>::dispatch(i, std::forward<Args>(args)...);
}

I'm omitting some details here, like converting a type list into an indexed type list or how to deal with return type, but the idea is clear, I hope.

I tried those on godbolt and Clang 4.0 generates a "logarithmic" switch (a tree of small switches) for the first approach and a code, identical to the manual switch for the second approach. I played with handler's size, inlinable or not inlinable handler and results look stable.

But GCC and ICC both generate a simple sequence of conditionals (for both approaches) which is very sad.

So, are there other solutions? Especially those working at least in Clang and GCC.

I have tried something similar to your problem here: [Compiler Explorer](https://godbolt.org/g/KRGvwY). Clang inline the dispatch based on a switch statement or on a table look up. GCC seems to inline none of them! I remember also having implemented some function using table look up and see the compiler generating switch like assembly. I suppose that as long as the table is a constexpr, compiler are oftenly smart enough to convert table look up to switch and the opposite depending on what their internal analysis conclude what is more appropriate. So I believe what you ask is impossible. — Oliv, Aug 25 '17 at 10:41
I've never found a technique that generates assembly as good as switch case. Consider this example: https://godbolt.org/g/aC2zLu. Even here the switch produces better code! Bottom line: learn to love boost PP. It makes working with the preprocessor quite bearable, it's a really nice library. — Nir Friedman, Aug 25 '17 at 12:42
https://stackoverflow.com/a/45380228/841108 is a relevant answer to a similar question (similar motivations) — Basile Starynkevitch, Aug 25 '17 at 12:48
@NirFriedman: Clang 4.0 can generate identical to a manual switch code. It's GCC (ICC) that's failing me... — Alexander Morozov, Aug 25 '17 at 12:52
@BasileStarynkevitch how that solution is better than manual switch?? It's even more verbose. — Alexander Morozov, Aug 25 '17 at 12:53
@AlexanderMorozov I took my example and changed it clang 4.0, and it still does an indirect call. — Nir Friedman, Aug 25 '17 at 13:55
A logarithmic switch is also not as good as a jump table. I'd be curious to see an example where any compiler generates a bona fide jump table from anything other than a switch case. — Nir Friedman, Aug 25 '17 at 14:04
@NirFriedman I provided the link to godbolt for my two solutions in post, but here it's again: https://godbolt.org/g/bEhz44 - solution with `std::initializer_list` is equivalent to manual switch. — Alexander Morozov, Aug 27 '17 at 08:03
@NirFriedman actually, Clang is able to fold a sequence of conditionals (on one variable) into a table, no matter if we write such sequence manually or if it's implicit, for example: https://godbolt.org/g/L1CdN7 — Alexander Morozov, Aug 29 '17 at 04:58

OutOfBound · Answer 1 · 2017-08-25T10:33:53.483

You can do the binary search yourself:

#include <tuple>
#include <iostream>

template <class T>
void function(T arg) {
    std::cout << arg << std::endl;
}

If you are able to use c++14 and can use auto for return type deduction and auto for generic lambda expressions, you can go even further:

#include <tuple>
#include <iostream>
#include <string>

template <class Function, size_t begin, size_t end, class Tuple>
struct  call_on_element;

template <class Function, size_t begin, class Tuple>
struct call_on_element<Function, begin, begin, Tuple> {
    static auto call(Function f, size_t, const Tuple& t) {
        return f(std::get<begin>(t));
    }
};

template <class Function, size_t begin, size_t end, class Tuple>
struct call_on_element {
    static auto call(Function f, size_t i, const Tuple& t) {
        constexpr size_t half = begin + (end - begin) / 2;
        if(i <= half) {
            return call_on_element<Function, begin, half, Tuple>::call(f, i, t);
        } else {
            return call_on_element<Function, half + 1, end, Tuple>::call(f, i, t);
        }
    }
};

template <class Function, class... Args>
auto dispatch(Function f, size_t i, Args&&... args) {
    const auto tuple = std::make_tuple(std::forward<decltype(args)>(args)...);
    return call_on_element<Function, 0, sizeof...(Args) - 1, decltype(tuple)>::call(f, i , tuple);
}


struct Functor {
    template <class T>
    std::string operator()(T arg) {
        std::cout << arg << std::endl;
        return "executed functor";
    }
};

int main() {

    int i = 42;
    char c = 'y';
    float f = 1337;

    std::cout << "using Functor" << std::endl;
    dispatch(Functor(), 0, i, c, f);
    dispatch(Functor(), 1, i, c, f);
    auto s = dispatch(Functor(), 2, i, c, f);

    std::cout << "return value of dispatch = " << s << std::endl;

    std::cout << "using lambda" << std::endl;
    dispatch([](auto arg) { std::cout << arg << std::endl;}, 0, i, c, f);
    dispatch([](auto arg) { std::cout << arg << std::endl;}, 1, i, c, f);
    dispatch([](auto arg) { std::cout << arg << std::endl;}, 2, i, c, f);

};

I guess this is harder to inline for the compiler, but much more compfortable to use and also it is not impossible.

To check for inlining i modified the c++14 version to this:

int main() {
    dispatch(1, 42, 'c', 3.14);
};

// compiled with 'g++ -g -O3 -std=c++14 -S main.cpp' (using gcc7.1.1)
// generated objectdump with objdump -d a.out | c++filt

Objdump delivers the following output:

00000000000009a0 <main>:
 9a0:   55                      push   %rbp
 9a1:   53                      push   %rbx
 9a2:   48 8d 3d d7 16 20 00    lea    0x2016d7(%rip),%rdi        # 202080 <std::cout@@GLIBCXX_3.4>
 9a9:   ba 01 00 00 00          mov    $0x1,%edx
 9ae:   48 83 ec 18             sub    $0x18,%rsp
 9b2:   48 8d 74 24 07          lea    0x7(%rsp),%rsi
 9b7:   c6 44 24 07 63          movb   $0x63,0x7(%rsp)
 9bc:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
 9c3:   00 00 
 9c5:   48 89 44 24 08          mov    %rax,0x8(%rsp)
 9ca:   31 c0                   xor    %eax,%eax
 9cc:   e8 7f ff ff ff          callq  950 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt>
 9d1:   48 89 c5                mov    %rax,%rbp
 9d4:   48 8b 00                mov    (%rax),%rax
 9d7:   48 8b 40 e8             mov    -0x18(%rax),%rax
 9db:   48 8b 9c 05 f0 00 00    mov    0xf0(%rbp,%rax,1),%rbx
 9e2:   00 
 9e3:   48 85 db                test   %rbx,%rbx
 9e6:   74 67                   je     a4f <main+0xaf>
 9e8:   80 7b 38 00             cmpb   $0x0,0x38(%rbx)
 9ec:   74 2d                   je     a1b <main+0x7b>
 9ee:   0f be 73 43             movsbl 0x43(%rbx),%esi
 9f2:   48 89 ef                mov    %rbp,%rdi
 9f5:   e8 86 ff ff ff          callq  980 <std::basic_ostream<char, std::char_traits<char> >::put(char)@plt>
 9fa:   48 89 c7                mov    %rax,%rdi
 9fd:   e8 5e ff ff ff          callq  960 <std::basic_ostream<char, std::char_traits<char> >::flush()@plt>
 a02:   31 c0                   xor    %eax,%eax
 a04:   48 8b 4c 24 08          mov    0x8(%rsp),%rcx
 a09:   64 48 33 0c 25 28 00    xor    %fs:0x28,%rcx
 a10:   00 00 
 a12:   75 36                   jne    a4a <main+0xaa>
 a14:   48 83 c4 18             add    $0x18,%rsp
 a18:   5b                      pop    %rbx
 a19:   5d                      pop    %rbp
 a1a:   c3                      retq   
 a1b:   48 89 df                mov    %rbx,%rdi
 a1e:   e8 fd fe ff ff          callq  920 <std::ctype<char>::_M_widen_init() const@plt>
 a23:   48 8b 03                mov    (%rbx),%rax
 a26:   48 8d 0d 73 01 00 00    lea    0x173(%rip),%rcx        # ba0 <std::ctype<char>::do_widen(char) const>
 a2d:   be 0a 00 00 00          mov    $0xa,%esi
 a32:   48 8b 50 30             mov    0x30(%rax),%rdx
 a36:   48 39 ca                cmp    %rcx,%rdx
 a39:   74 b7                   je     9f2 <main+0x52>
 a3b:   be 0a 00 00 00          mov    $0xa,%esi
 a40:   48 89 df                mov    %rbx,%rdi
 a43:   ff d2                   callq  *%rdx
 a45:   0f be f0                movsbl %al,%esi
 a48:   eb a8                   jmp    9f2 <main+0x52>
 a4a:   e8 21 ff ff ff          callq  970 <__stack_chk_fail@plt>
 a4f:   e8 bc fe ff ff          callq  910 <std::__throw_bad_cast()@plt>
 a54:   66 90                   xchg   %ax,%ax
 a56:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 a5d:   00 00 00 

template <size_t begin, size_t end, class Tuple>
struct  call_on_element;

template <size_t begin, class Tuple>
struct call_on_element<begin, begin, Tuple> {
    static void call(size_t, const Tuple& t) {
        function(std::get<begin>(t));
    }
};

template <size_t begin, size_t end, class Tuple>
struct call_on_element {
    static void call(size_t i, const Tuple& t) {
        constexpr size_t half = begin + (end - begin) / 2;
        if(i <= half) {
            call_on_element<begin, half, Tuple>::call(i, t);
        } else {
            call_on_element<half+1, end, Tuple>::call(i, t);
        }
    }
};

template <class... Args>
void dispatch(size_t i, Args&&... args) {
    const auto tuple = std::make_tuple(std::forward<decltype(args)>(args)...);
    call_on_element<0, sizeof...(Args)-1, decltype(tuple)>::call(i , tuple);
}

int main() {

    int i = 42;
    char c = 'y';
    float f = 1337;

    dispatch(0, i, c, f);
    dispatch(1, i, c, f);
    dispatch(2, i, c, f);

};

The calls to callq std::basic_ostream<...> and je ... make me believe, that it actually inlines the function and does the logarithmic condition tree.

I updated test on godbolt with your variant. Unfortunately, godbolt fails to produce short url. But the gist is - Clang 4.0 emit a separate code block for every 50 elements in a list, while GCC 7.2 inlines all. Unfortunately, both compiler produce a lot of conditionals, not table jumps as for `switch`. — Alexander Morozov, Aug 25 '17 at 10:47
I thought you didnt want table jumps. Table jumps are easier to implement :D — OutOfBound, Aug 25 '17 at 10:51
I didn't want a pointer's table - a table holding pointers to handlers functions. Manual switch is transformed into jump's table, without any function call. See the link to godbolt in my original post, there is several implementations, including manual switch. — Alexander Morozov, Aug 25 '17 at 12:36

score 2 · Answer 2 · answered Aug 25 '17 at 12:24

For keeping a simple point of declaration, I am a huge fan of Boost.Preprocessor. I agree these are macros and you may despise them but why not use them when they are actually improving code readability and maintenance ?

#include <boost/preprocessor.hpp>

#include <iostream>
//the following two #include are only for handler() code
#include <typeinfo>
#include <tuple>

template <class T>
void handler(int x) {
  std::cout << typeid(T).name() << ' ' << x << '\n';
}

#define TYPE_LIST (int)(char)(bool) //just declare your types here

using list = std::tuple<BOOST_PP_SEQ_ENUM(TYPE_LIST)>;

template <class... Args>
auto dispatch(size_t i, Args&& ...args)
{
  switch (i) {
#define TYPE_case(r, data, i, type) case i: return handler<type>(std::forward<Args>(args)...);
    BOOST_PP_SEQ_FOR_EACH_I(TYPE_case,, TYPE_LIST);
#undef TYPE_case
  }
}

int main(int, char**) {
  for (int i = 0; i < std::tuple_size<list>::value; ++i) {
    dispatch(i, i);
  }
}

Well, macros are bad for external tools like `grep` or `ack` (or event `ctags`) and more error prone. And they are harder to debug and understand. I prefer external code generators to macros, but C++ templates machinery is better (no external tools, type safety, easier to debug). — Alexander Morozov, Aug 25 '17 at 12:41
I agree this is not ideal for external tools. I also prefer templates and only use preprocessor as a last resort but, in my opinion, having to keep multiple declarations in sync (even more across multiple files) is an absolute worst. As of yet I have no code generator in my toolchain, so using Boost.Preprocessor has no impact on the build process. — Boris L., Aug 25 '17 at 12:57

score 0 · Answer 3 · answered Aug 25 '17 at 10:12

0

How much do you despise MACROS?

template <class... Args>
auto dispatch(size_t i, Args&& ...args)
{
    #define CASE(N) case N: return handler<typename std::tuple_element<N, list>::type>(std::forward<Args>(args)...)
    switch (i) {
        CASE(0);
        CASE(1);
        CASE(2);
    }
    #undef CASE
}

answered Aug 25 '17 at 10:12

sp2danny

7,488
3
31
53

Well, this is just a manual switch with a little less typing. The main problem is that when you change your type list, you have to change a switch. I aim for one single point of specification (just a type list should contain all information for its processing). – Alexander Morozov Aug 25 '17 at 10:23

OutOfBound · Answer 4 · 2017-08-25T13:15:41.863

0

This is the way to get a generic jump table. If it is used more than once it might be a good idea to store JumpTable inside a class and make dispatch into a functor.

#include <tuple>
#include <vector>
#include <iostream>
#include <utility>
#include <experimental/array>

template <size_t id, class Tuple, class Function>
auto functionWrapper() {
    return [](const Tuple& tuple, Function f) {
        f(std::get<id>(tuple));
    };
}

template <class Tuple, class Function, size_t... id>
auto arrayWrapper(std::index_sequence<id...>) {

    return std::experimental::make_array(
        functionWrapper<id, Tuple, Function>()...
    );

}

template <class Function, class... Args>
void dispatch(Function f, size_t i, Args&&... args) {
    using TupleType = std::tuple<Args...>;

    const auto JumpTable = arrayWrapper<TupleType, Function>(std::make_index_sequence<sizeof...(Args)>());


    JumpTable[i](std::make_tuple(args...), f);
}

int main() {
    for(int i = 0; i < 3; i++) {
        dispatch(
            [](auto arg){ std::cout << arg << std::endl; },
            i, // index
            42, 'c', 3.14 // args...
        );
    }
};

edited Aug 25 '17 at 13:15

answered Aug 25 '17 at 11:25

OutOfBound

1,914
14
31

1

You looked at the assembly and verified that all the std:: function crud gets eliminated? – Nir Friedman Aug 25 '17 at 11:50
1

I looked at the asm, but didn't understand it enough to be sure. I didn't found any `std::function` constructor and `operator()` in the function part, but the main function got so long, that i don't understand what is going on. But i dont think, that it matters. Either you get inlining with the first approch and O(n*log n) or you add indirection with the function table and get O(1). Benchmarking the code using both approaches will give more reliable results, than trying to interpret the asm code anyways. My guss would be, that for `n < 10` inlining will be faster, but who knows. – OutOfBound Aug 25 '17 at 12:05
I actually don't think this code is even correct, you're mixing up the tuple argument list with the type list that controls which dispatcher gets called. You don't mention `list` at all, or any type list, just turn the arguments into a tuple. The arguments to the dispatcher are different from the list of types that index the different dispatchers. What you're saying about perf may be true, but it's unlikely that any of the approaches presented so far will be as fast as switch case (if the handlers are small). – Nir Friedman Aug 25 '17 at 12:13
The order doesn't matter, there's just no relationship between what you are doing and the original problem. The original problem was to call `handler>(args...)`. That's not what is happening here. You are basically calling `handler(get)`. – Nir Friedman Aug 25 '17 at 12:28
Actually, my goal is to call `handler::type>(args...)`. No need to materialize a type list, it lives only in compile time. But solution with pointer's table can be tailored for that problem. Only I wish to avoid any function calls, which I don't believe this solution allows. – Alexander Morozov Aug 25 '17 at 12:46
Also, here is pointer's table solution without any `std::function`: https://stackoverflow.com/questions/23430249/optimize-template-replacement-of-a-switch (I mentioned it in the original post). – Alexander Morozov Aug 25 '17 at 12:47
@NirFriedman you were right. My solution only worked for tuples with different types. I updated my answer, have a look. – OutOfBound Aug 25 '17 at 12:59
@AlexanderMorozov i modified the code to not contain `std::function`. With `__attribute__((always_inline))` you can force every function except the one for the conditional jump to inline, but since the compiler doesn't do it on his own it probably doesn't yield any improvement. After all your problem looks much like immature optimization. A call to a function is not expensive. If you have not benchmarked your code an know, that his is your bottleneck, just ignore the calls. – OutOfBound Aug 25 '17 at 13:45
@OutOfBound your solution is not a jump table, it's a pointer's table. Jump table - is what a compiler generate for a usual manual switch. See https://godbolt.org/g/bEhz44 - there are two cases which produce a jump table. – Alexander Morozov Aug 27 '17 at 08:12
@OutOfBound it's not that your solution is bad (although it's indeed, is not directly relevant to original problem - you're dispatching values, not types, but similar solution can be done for types), it's that this solution is known and has particular disadvantages. I mentioned an SO answer with pointer's table in original post. – Alexander Morozov Aug 27 '17 at 08:16
@OutOfBound as for expensiveness of a function call - it's in general more expensive than a simple jump. So, pointer's table is not an equivalent substitute for a jump table, while in a some particular case it can be a good enough substitute - no argue here. But I'm interested in an equivalent substitute. – Alexander Morozov Aug 27 '17 at 08:22

How to compel a compiler to generate a code equivalent for a manual switch?

4 Answers4