C++ templates for performance?

Question

I have seen online a few times it has been mentioned that C++ can be ever faster using templates.

Could someone explain, including at a low level why this is exactly? I always presumed such a "nice" feature would have overhead like most useful concepts.

I am really intrigued by this from a ultra low latency perspective!

`template`s are compile time feature. It may reduce the runtime, but would increase the compilation time. If used without precaution, then it may lead to code bloat in some cases. — iammilind, Jan 19 '12 at 11:29

Steve Jessop · Accepted Answer · 2012-01-19T12:13:38.283

A common example is sorting.

In C, qsort takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.

In C++, std::sort is a template, and it can take a functor object as comparator. There is a different copy of std::sort for each different type used as a comparator. Assuming you use a functor class with overloaded operator(), then the call to the comparator can easily be inlined into this copy of std::sort.

So, templates give you more inlining because there are more copies of the sort code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort running faster than an equivalent qsort. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.

In principle there's no reason why a C implementation can't inline qsort into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort. But in practice, compilers tend not to take the first step, inlining qsort. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort is called in different places with the same comparator.

So, general-purpose functions like qsort in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.

It's worth noting that templates are not by any means just about performance. std::sort is itself more general-purpose than qsort in some ways. For example qsort only sorts arrays, whereas std::sort can sort anything that provides a random-access iterator. It can for example sort a deque, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.

[*] another example with sorting - qsort takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy or similar with the value of this variable. std::sort knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort with a literal value for the size, and that call to qsort was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.

+1 for sort example, although TLDR. – avakar Jan 19 '12 at 12:25 — avakar, Jan 19 '12 at 12:25

score 37 · Answer 2 · answered Jan 19 '12 at 11:50

37

"faster" depends on what you compare it to.

Templates are fully evaluated by the compiler, and so they have zero overhead at runtime. Calling Foo<int>() is exactly as efficient as calling FooInt().

So compared to approaches which rely on more work being done at runtime, for example by calling virtual functions, templates can indeed be faster. Compared to hand-written code written precisely for that scenario, there is zero difference.

So the nice thing about templates isn't that they are "faster" than what you could do otherwise, but that they are "as fast" as hand-written code, while also being generic and reusable.

answered Jan 19 '12 at 11:50

jalf

243,077
51
345
550

11

+1 for providing an useful answer without any of that factorial crap. – R. Martinho Fernandes Jan 19 '12 at 11:57
2

At some point, it does become faster than what you could do otherwise, simply because doing otherwise every time becomes onerous enough that you don't do it well. – Mar 12 '12 at 06:42
1

@Hurkyl: not if you assume infinite time to write the code. :) and that's my point: turning a function into a template won't make it faster. But because it is more reusable, it might free up dev time, which can be used to optimize. – jalf Mar 12 '12 at 08:35

score 10 · Answer 3 · answered Jan 19 '12 at 11:57

Another remarkable example of using templates to improve runtime performance is the Blitz++ numerics library. It pioneered the use of so-called expression templates, using compile-time logic to transform arithmetic expressions involving large vectors and matrices into equivalent ones that are much easier to compile to efficient machine code. For instance, given the following pseudocode:

vector<1000> a = foo(), b = bar(), c = baz(), result;
result = a + b + c;

A naive approach would add each elemnt of a and b together, store the result in a temporary vector, then do the same with c, and finally copy the result into result. Using expression template magic, the resulting code will instead be equivalent to this:

for(int i = 0; i < 1000; ++i) {
    result[i] = a[i] + b[i] + c[i];
}

This is much faster, making better use of cache locality and avoiding unnecessary temporaries along the way. It also avoids aliasing problems, where the compiler cannot prove that two pointers point to distinct memory areas, forcing it to produce unoptimal code. Expression templates are now commonly used in high-performance numerics, as well as having other uses not involving performance, such as the Boost.Spirit parsing library.

Boost.Spirit's performance is _significantly_ affected by the library's use of expression templates. — ildjarn, Jan 24 '12 at 22:00

score 6 · Answer 4 · answered Jan 19 '12 at 11:25

6

I am not sure if you are talking about C++ template metaprogramming : doing some calculation during compilation time so you get result during run time almost instantly. If so,here is a example.

By using template metaprogramming and template specialization to provide the ending condition for the recursion, the factorials used in the program, ignoring any factorial not used, can be calculated at compile-time by this code

template <int N>
struct Factorial 
{
    enum { value = N * Factorial<N - 1>::value };
};

template <>
struct Factorial<0> 
{
    enum { value = 1 };
};

// Factorial<4>::value == 24
// Factorial<0>::value == 1
void foo()
{
    int x = Factorial<4>::value; // == 24
    int y = Factorial<0>::value; // == 1
}

Here is a bit more to read http://en.wikipedia.org/wiki/Template_metaprogramming

answered Jan 19 '12 at 11:25

RoundPi

5,819
7
49
75

Well, that, and you can instantiate `matrix`, which will compute significantly faster than a matrix containing object references that have to be downcast in order to actually use binary operators. – Simon Richter Jan 19 '12 at 11:34
@Simon: I am not sure I got what you mean. Are you saying normal c++ template can speed up calculation as well apart from template metaprogramming? – RoundPi Jan 19 '12 at 11:47
@Gob00st: depends on what you call metaprogramming. In a sense, normal use of templates for type generalization is already metaprogramming, in that it "calculates" the correct function redirections at compile-time; other solutions e.g. with polymorphic base class pointers require an extra vtable lookup before the specialized function is called. – leftaroundabout Jan 19 '12 at 14:44
Exactly. For cases like the multiplication of two numbers, each represented by an object, it'd require more than that if you still wanted to use the same algorithm for different kinds of numbers. – Simon Richter Jan 19 '12 at 15:04
The reason I mentioned the Factorial example is bcos I was assuming that OP already understand the general c++ concept: e.g. inline, virtual funciton etc... – RoundPi Jan 19 '12 at 17:27

score 3 · Answer 5 · answered Jan 19 '12 at 11:28

Most likely they're talking about template metaprogramming, which is a trade between compilation speed for runtime speed. The basic idea is that you can write a program which will execute in the C++ compiler. For example (stolen from wikipedia):

template <int N>
struct Factorial 
{
    enum { value = N * Factorial<N - 1>::value };
};

template <>
struct Factorial<0> 
{
    enum { value = 1 };
};

// Factorial<4>::value == 24
// Factorial<0>::value == 1
void foo()
{
    int x = Factorial<4>::value; // == 24
    int y = Factorial<0>::value; // == 1
}

Thus, to calculate Factorial<4>::value, the compiler needs to "unroll" the template and calculate Factorial<3>::value and so on. This is all done at compile time, which obviously increases the amount of time to compile, but will effectively replace it with a constant value at runtime.

lol same example as Gob00st's answer!~ – Bingo Jan 19 '12 at 11:30 — Bingo, Jan 19 '12 at 11:30
@Bingo Indeed...apparently I need to type faster. – Yuushi Jan 19 '12 at 11:31 — Yuushi, Jan 19 '12 at 11:31
@Bingo lol this example is the most classic one ! – RoundPi Jan 19 '12 at 11:44 — RoundPi, Jan 19 '12 at 11:44

score 0 · Answer 6 · answered Jan 19 '12 at 11:57

The reason templates are considered faster is that they are visible to the compiler.

So while a plain C sorting algorithm would look like this:

void qsort ( void * base, size_t num, size_t size, 
             int ( * comparator ) ( const void *, const void * ) );

Accepting a function pointer to do comparisons and thus inuring a function call on each comparison the C++ version would look like this:

template <class RandomAccessIterator, class StrictWeakOrdering>
void sort(RandomAccessIterator first, RandomAccessIterator last,
                StrictWeakOrdering comp);

Thus comp is a template argument and if it's a class with operator() defined the compiler can inline the implementation of the function into the loop and avoid many function calls.

I don't think the point that was being made is that template meta programming is faster since this is a rarely used feature in most code bases.

C++ templates for performance?

6 Answers6

Linked