Switch statement variadic template expansion

Question

Let me please consider the following synthetic example:

inline int fun2(int x) {
    return x;
}
inline int fun2(double x) {
    return 0;   
}
inline int fun2(float x) {
    return -1;   
}

int fun(const std::tuple<int,double,float>& t, std::size_t i) {
    switch(i) {
        case 0: return fun2(std::get<0>(t));
        case 1: return fun2(std::get<1>(t));
        case 2: return fun2(std::get<2>(t));
    }    
}

The question is how should I expand this to the general case

template<class... Args> int fun(const std::tuple<Args...>& t, std::size_t i) {
// ?
}

Guaranteeing that

fun2 can be inlined into fun
search complexity not worse than O(log(i)) (for large i).

It is known that optimizer usually uses lookup jump table or compile-time binary search tree when large enough switch expanded. So, I would like to keep this property affecting performance for large number of items.

Update #3: I remeasured performance with uniform random index value:

                      1       10      20      100
@TartanLlama
    gcc               ~0      42.9235 44.7900 46.5233
    clang             10.2046 38.7656 40.4316 41.7557
@chris-beck
    gcc               ~0      37.564  51.3653 81.552
    clang             ~0      38.0361 51.6968 83.7704
naive tail recursion
    gcc                3.0798 40.6061 48.6744 118.171
    clang             11.5907 40.6197 42.8172 137.066
manual switch statement
    gcc                       41.7236 
    clang                      7.3768

Update #2: It seems that clang is able to inline functions in @TartanLlama solution whereas gcc always generates function call.

A binary search tree is actually much better than O(log(i)). It's O(log(k)), where k is the number of distinct i in the switch. — Borealid, Aug 26 '15 at 20:49
https://gist.github.com/lichray/dd803a8bb3461fc842e5 (not my code, it's a C++Now 2015 lightning talk). — T.C., Aug 26 '15 at 22:06
Just as a comment on the optimization results -- a lot of times when I use variants in my programs, I'm doing it for extra flexibility, but one of the data types of the variant is actually the most common. For instance if you have a `std::vector>` often one of the messages is the most common. Or if you have something like a boost property tree, each node is a variant but if the fanout is large then most of the nodes are leaves (strings). In these situations my implementation will benefit greatly from branch prediction, while a jump table won't. — Chris Beck, Aug 27 '15 at 17:40
Out of interest, can you share the program you are using for benchmarking so we can poke and prod it? — TartanLlama, Aug 27 '15 at 21:37
https://gist.github.com/matwey/d2985fb9dcb14302ad33 https://gist.github.com/matwey/3dbeacee1cdaf3217dfd Please, also point me where I am wrong. — 0x2207, Aug 28 '15 at 07:21
@0x2207 If I run this program, I get `5500000000 must be 21000000000` This does not look like it's a correct result? — dyp, Aug 28 '15 at 07:55
@0x2207 I think it should be `55*ntry`, because `10*(10+1)/2 == 110/2 == 55` — dyp, Aug 28 '15 at 08:02
On my machine, using `-O3 -march=native`, clang++ produces a program which is almost **20 times faster** than what g++ produces, for TartanLlama's solution. — dyp, Aug 28 '15 at 08:05
Interestingly, the reverse happens for Chris Beck's solution :( The time required by the executable produced by clang++ for Chris Beck's solution is of the same order as the time required by the exe produced by g++ for TartanLlama's solution. — dyp, Aug 28 '15 at 08:07
@dyp this is because clang++ inlined all functions and calculated 55 for you in compile time, could you look into assembler? Seems we need different test-suite. — 0x2207, Aug 28 '15 at 08:14
@0x2207 Yes, certainly looks like it. OTOH, it seems g++ precomputes the other solution. http://coliru.stacked-crooked.com/a/02276d4082e96c3c — dyp, Aug 28 '15 at 08:25
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/88164/discussion-between-0x2207-and-dyp). — 0x2207, Aug 28 '15 at 08:27

Chris Beck · Answer 1 · 2015-08-27T18:33:39.870

Ok, I rewrote my answer. This gives a different approach to what TartanLlama and also what I suggested before. This meets your complexity requirement and doesn't use function pointers so everything is inlineable.

Edit: Much thanks to Yakk for pointing out a quite significant optimization (for the compile-time template recursion depth required) in comments

Basically I make a binary tree of the types / function handlers using templates, and implement the binary search manually.

It might be possible to do this more cleanly using either mpl or boost::fusion, but this implementation is self-contained anyways.

It definitely meets your requirements, that the functions are inlineable and runtime look up is O(log n) in the number of types in the tuple.

Here's the complete listing:

#include <cassert>
#include <cstdint>
#include <tuple>
#include <iostream>

using std::size_t;

// Basic typelist object
template<typename... TL>
struct TypeList{
   static const int size = sizeof...(TL);
};

// Metafunction Concat: Concatenate two typelists
template<typename L, typename R>
struct Concat;

template<typename... TL, typename... TR>
struct Concat <TypeList<TL...>, TypeList<TR...>> {
    typedef TypeList<TL..., TR...> type;
};

template<typename L, typename R>
using Concat_t = typename Concat<L,R>::type;

// Metafunction First: Get first type from a typelist
template<typename T>
struct First;

template<typename T, typename... TL>
struct First <TypeList<T, TL...>> {
    typedef T type;
};

template<typename T>
using First_t = typename First<T>::type;


// Metafunction Split: Split a typelist at a particular index
template<int i, typename TL>
struct Split;

template<int k, typename... TL>
struct Split<k, TypeList<TL...>> {
private:
    typedef Split<k/2, TypeList<TL...>> FirstSplit;
    typedef Split<k-k/2, typename FirstSplit::R> SecondSplit;
public:
    typedef Concat_t<typename FirstSplit::L, typename SecondSplit::L> L;
    typedef typename SecondSplit::R R;
};

template<typename T, typename... TL>
struct Split<0, TypeList<T, TL...>> {
    typedef TypeList<> L;
    typedef TypeList<T, TL...> R;
};

template<typename T, typename... TL>
struct Split<1, TypeList<T, TL...>> {
    typedef TypeList<T> L;
    typedef TypeList<TL...> R;
};

template<int k>
struct Split<k, TypeList<>> {
    typedef TypeList<> L;
    typedef TypeList<> R;
};


// Metafunction Subdivide: Split a typelist into two roughly equal typelists
template<typename TL>
struct Subdivide : Split<TL::size / 2, TL> {};

// Metafunction MakeTree: Make a tree from a typelist
template<typename T>
struct MakeTree;

/*
template<>
struct MakeTree<TypeList<>> {
    typedef TypeList<> L;
    typedef TypeList<> R;
    static const int size = 0;
};*/

template<typename T>
struct MakeTree<TypeList<T>> {
    typedef TypeList<> L;
    typedef TypeList<T> R;
    static const int size = R::size;
};

template<typename T1, typename T2, typename... TL>
struct MakeTree<TypeList<T1, T2, TL...>> {
private:
    typedef TypeList<T1, T2, TL...> MyList;
    typedef Subdivide<MyList> MySubdivide;
public:
    typedef MakeTree<typename MySubdivide::L> L;
    typedef MakeTree<typename MySubdivide::R> R;
    static const int size = L::size + R::size;
};

// Typehandler: What our lists will be made of
template<typename T>
struct type_handler_helper {
    typedef int result_type;
    typedef T input_type;
    typedef result_type (*func_ptr_type)(const input_type &);
};

template<typename T, typename type_handler_helper<T>::func_ptr_type me>
struct type_handler {
    typedef type_handler_helper<T> base;
    typedef typename base::func_ptr_type func_ptr_type;
    typedef typename base::result_type result_type;
    typedef typename base::input_type input_type;

    static constexpr func_ptr_type my_func = me;
    static result_type apply(const input_type & t) {
        return me(t);
    }
};

// Binary search implementation
template <typename T, bool b = (T::L::size != 0)>
struct apply_helper;

template <typename T>
struct apply_helper<T, false> {
    template<typename V>
    static int apply(const V & v, size_t index) {
        assert(index == 0);
        return First_t<typename T::R>::apply(v);
    }
};

template <typename T>
struct apply_helper<T, true> {
    template<typename V>
    static int apply(const V & v, size_t index) {
        if( index >= T::L::size ) {
            return apply_helper<typename T::R>::apply(v, index - T::L::size);
        } else {
            return apply_helper<typename T::L>::apply(v, index);
        }
    }
};

// Original functions

inline int fun2(int x) {
    return x;
}
inline int fun2(double x) {
    return 0;   
}
inline int fun2(float x) {
    return -1;   
}

// Adapted functions
typedef std::tuple<int, double, float> tup;

inline int g0(const tup & t) { return fun2(std::get<0>(t)); }
inline int g1(const tup & t) { return fun2(std::get<1>(t)); }
inline int g2(const tup & t) { return fun2(std::get<2>(t)); }

// Registry

typedef TypeList<
   type_handler<tup, &g0>,
   type_handler<tup, &g1>,
   type_handler<tup, &g2>
> registry;

typedef MakeTree<registry> jump_table;

int apply(const tup & t, size_t index) {
    return apply_helper<jump_table>::apply(t, index);
}

// Demo

int main() {
    {
        tup t{5, 1.5, 15.5f};

        std::cout << apply(t, 0) << std::endl;
        std::cout << apply(t, 1) << std::endl;
        std::cout << apply(t, 2) << std::endl;
    }

    {
        tup t{10, 1.5, 15.5f};

        std::cout << apply(t, 0) << std::endl;
        std::cout << apply(t, 1) << std::endl;
        std::cout << apply(t, 2) << std::endl;
    }

    {
        tup t{15, 1.5, 15.5f};

        std::cout << apply(t, 0) << std::endl;
        std::cout << apply(t, 1) << std::endl;
        std::cout << apply(t, 2) << std::endl;
    }

    {
        tup t{20, 1.5, 15.5f};

        std::cout << apply(t, 0) << std::endl;
        std::cout << apply(t, 1) << std::endl;
        std::cout << apply(t, 2) << std::endl;
    }
}

Live on Coliru: http://coliru.stacked-crooked.com/a/3cfbd4d9ebd3bb3a

Also, T.C.'s comment above shows another solution in this vein. — Chris Beck, Aug 26 '15 at 23:26
Split can be written with less than O(n) recursive depth with a few tricks. And avoiding O(n) recursive depth is key, as there is a relatively shallow limit on template recursion in C++11 in both theory and practice. Replace `Prepend` with `Concat` (which is about as easy to write as Prepend). Instead of `Split::left` is `Concat_t< Split::left, Split::right>::left >`, and `Split::right` is `Split::right>::right`. — Yakk - Adam Nevraumont, Aug 27 '15 at 02:14
I tried to work through this but I don't get it. What is X here, the typelist? X needs to be getting smaller at each level of the recursion or it doesn't work. I believed that with current C++ template partial specialization rules, I can only have one "..." item and it has to be the last. So I can only possibly pull O(1) items off the front of any type list in any one iteration. So if I have a typelist of length O(n) I think I need recursive depth O(n). (Please correct me if I'm wrong.) I thought the only way to get less would be to pass the arguments as a type tree instead of as a type list. — Chris Beck, Aug 27 '15 at 02:54
Yes, X is the list. Split needs to operate on `TypeList` not `Ts...` for the above to work (as the recursions take returned typelists, and feed the to other `Split`s). You can peel 1 off per step, but instead of a linear expansion the above builds a binary tree, with the 'split 1 off' as leaves. Depth is logarithmic, volume is linear (2n). The code blocks in my precious comment are almost the code required (missing `typename` spam). — Yakk - Adam Nevraumont, Aug 27 '15 at 03:02
Is it actually logarithmic when you do it that way though? Because then the right side needs to rely on the result of the left side. If I typedef the result of the left and use it in the right, does the template engine expand that out recursively or do it the smart way? Edit: Okay I think I get it now, going to try to finish it — Chris Beck, Aug 27 '15 at 03:06
No, I'm a disbeliever again. I must be missing some basic TMP technique. Here's something easier than `Split` which I don't see how to do with o(n) recursive depth -- get the last element of a typelist. If you can't do that then you can't do Split(T::size-1, T) in o(n) either, where n = T::size. — Chris Beck, Aug 27 '15 at 03:21
Ok, I get it now :) thank you, I'm sure that I learned a crucial TMP technique here — Chris Beck, Aug 27 '15 at 03:33

score 5 · Answer 2 · edited May 23 '17 at 12:10

If you make fun2 into a class with overloaded operator():

struct fun2 {
    inline int operator()(int x) {
        return x;
    }
    inline int operator()(double x) {
        return 0;   
    }
    inline int operator()(float x) {
        return -1;   
    }
};

then we can modify dyp's answer from here to work for us.

Note that this would look a lot neater in C++14, as we could have all the return types deduced and use std::index_sequence.

//call the function with the tuple element at the given index
template<class Ret, int N, class T, class Func>
auto apply_one(T&& p, Func func) -> Ret
{
    return func( std::get<N>(std::forward<T>(p)) );
}

//call with runtime index
template<class Ret, class T, class Func, int... Is>
auto apply(T&& p, int index, Func func, seq<Is...>) -> Ret
{
    using FT = Ret(T&&, Func);
    //build up a constexpr array of function pointers to index
    static constexpr FT* arr[] = { &apply_one<Ret, Is, T&&, Func>... };
    //call the function pointer at the specified index
    return arr[index](std::forward<T>(p), func);
}

//tag dispatcher
template<class Ret, class T, class Func>
auto apply(T&& p, int index, Func func) -> Ret
{
    return apply<Ret>(std::forward<T>(p), index, func, 
                      gen_seq<std::tuple_size<typename std::decay<T>::type>::value>{});
}

We then call apply and pass the return type as a template argument (you could deduce this using decltype or C++14):

auto t = std::make_tuple(1,1.0,1.0f);
std::cout << apply<int>(t, 0, fun2{}) << std::endl;
std::cout << apply<int>(t, 1, fun2{}) << std::endl;
std::cout << apply<int>(t, 2, fun2{}) << std::endl;

Live Demo

I'm not sure if this will completely fulfil your requirements due to the use of function pointers, but compilers can optimize this kind of thing pretty aggressively. The searching will be O(1) as the pointer array is just built once then indexed directly, which is pretty good. I'd try this out, measure, and see if it'll work for you.

So, arr[index](std::forward(p), func) is not inlined. This adds function call overhead as I see from callgrind profile. The question here is whether it possible to ask compiler to inline the call generating switch-like code. — 0x2207, Aug 27 '15 at 15:23
I've played around with a bit and I can't find a way to get it to inline. Looks like Chris' solution will work better for you unless someone can work some magic on this. — TartanLlama, Aug 27 '15 at 21:30
I've asked in gcc maillist about technical possibility for this code to be inlined but no response yet. — 0x2207, Aug 28 '15 at 07:16
@0x2207 clang++ seems to do some kind of inlining: http://goo.gl/jnY5yx It compiles the jump table to a single `jmpq` instruction; of course the `operator()` are inlined in the `apply_one` functions. — dyp, Aug 28 '15 at 07:32

Switch statement variadic template expansion

2 Answers2

Linked