A common example is sorting.
In C, qsort
takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort
code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.
In C++, std::sort
is a template, and it can take a functor object as comparator. There is a different copy of std::sort
for each different type used as a comparator. Assuming you use a functor class with overloaded operator()
, then the call to the comparator can easily be inlined into this copy of std::sort
.
So, templates give you more inlining because there are more copies of the sort
code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort
running faster than an equivalent qsort
. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.
In principle there's no reason why a C implementation can't inline qsort
into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort
. But in practice, compilers tend not to take the first step, inlining qsort
. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort
for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort
is called in different places with the same comparator.
So, general-purpose functions like qsort
in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.
It's worth noting that templates are not by any means just about performance. std::sort
is itself more general-purpose than qsort
in some ways. For example qsort
only sorts arrays, whereas std::sort
can sort anything that provides a random-access iterator. It can for example sort a deque
, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.
[*] another example with sorting - qsort
takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy
or similar with the value of this variable. std::sort
knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort
with a literal value for the size, and that call to qsort
was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.