Why std::function is too slow is CPU can't utilize instruction reordering?

Question

While i develop my project, I found std::function is really slow.
So I tried to know why it's really slow.
But I couldn't find obvious cause of that.

I think What Cpu can't utilize instruction reordering optimization and cpu pipeline because it doesn't know which function is called is cause of poor performance.
It makes memory stall and slow performance....

Am I right???

Is code of your function wrapped in std::function small or big? If it is small, like 3-5 CPU instructions then yes std::function will make it slower, because std::function is not inlined into outer calling code. You should use only lambda and pass lambda as template parameter to other functions, lambdas are inlined into calling code. If your wrapped function code is big then it will make no difference whether it is inside std::function wrapper or not. — Arty, May 20 '21 at 06:56
does https://stackoverflow.com/questions/18608888/c11-stdfunction-slower-than-virtual-calls and https://stackoverflow.com/questions/5057382/what-is-the-performance-overhead-of-stdfunction help ? — Martin Morterol, May 20 '21 at 06:56
That's because of the copy of the istance of the std::function. If the lifetime of your callable allows you to do it, consider using std::ref to explicitly require to pass the function as reference (avoiding copy). — fiorentinoing, May 20 '21 at 07:31
It could be one of the reasons. Or maybe not. We know nothing about how you build your project. Did you profile it with optimisations on? What are the results of the profiling? Please show the numbers. — n. m. could be an AI, May 20 '21 at 07:39

Arty · Accepted Answer · 2021-05-20T07:48:17.233

5

Code wrapped into std::function is always slower than inlining code directly into calling place. Especially if your code is very short, like 3-5 CPU instructions.

If your function's code is quite big, hundreds of instructions then there will be no difference whether to use std::function or some other mechanism of calling/wrapping code.

std::function code is not inlined. Using std::function wrapper has almost same speed overhead like using virtual methods in class. More than that std::function mechanism looks very much like virtual call mechanism, in both cases code is not inlined and pointer to code is used to call it with assembler's call instruction.

If you really need speed then use lambdas and pass them around as templated parameters, like below. Lambdas are always inlined if possible (and if compiler decides that it will improve speed).

Try it online!

#include <functional>

template <typename F>
void __attribute__((noinline)) use_lambda(F const & f) {
    auto volatile a = f(13); // call f
    // ....
    auto volatile b = f(7); // call f again
}

void __attribute__((noinline)) use_func(
        std::function<int(int)> const & f) {
    auto volatile a = f(11); // call f
    // ....
    auto volatile b = f(17); // call f again
}

int main() {
    int x = 123;
    auto f = [&](int y){ return x + y; };
    use_lambda(f); // Pass lambda
    use_func(f); // Pass function
}

If you look at assembler code of above example (click Try-it-online link above) then you can see that lambda code was inlined while std::function code wasn't.

Template params are always faster than other solutions, you should always use templates everywhere where you need polymorphism while having high performance.

edited May 20 '21 at 07:48

answered May 20 '21 at 07:13

Arty

14,883
6
36
69

Thanks. But if i wanna store callable object to call it later, std::function isn't avoidable right?? – SungJinKang May 20 '21 at 07:54
@SungJinKang If you can you should store your function as templated parameter and pass around as templated parameter. Look at my main() code above, I also stored f() lambda inside local variable f. It is not a problem, it doesn't prevent inlining. Just pass around f as templated parameter everywhere. std::function should be used only for polymorphism case, when you want to pass different variants of f() function without using any templates. – Arty May 20 '21 at 07:59
Ok I undertand. I mean if i wanna store various functions with same type in std:;vector, std::function is unavoidable. right?? – SungJinKang May 20 '21 at 08:14
@SungJinKang If you want to store different lambdas (different codes) inside std::vector then yes you have to use std::function, it is unavoidable. Otherwise you have to use std::tuple, which allows you to store different types, std::tuple is like a vector that stores multiple types. If you have just 2-3 different types then you can store them as std::tuple of std::vector's, each vector stores an array of same types. – Arty May 20 '21 at 08:45
@SungJinKang -- if you want to store functions **with the same type** you don't need `std::function`. Just create a vector of that type. `std::function`'s job is to make functions with **different types** look alike. – Pete Becker May 20 '21 at 12:21
1

@PeteBecker Probably I miss-understood, I thought when @SungJinKang said he wants to store functions with **same** type, I thought he meant they have same signature e.g. `void()`, but different codes, then std::function is unavoidable. But if **same** type here refers to all functions having same source of code e.g. pointer to single global function `&ConcreteFunc` then yes this **single** type can be stored without std::function, just as a raw type hence will be inlined correctly when called. – Arty May 20 '21 at 13:53
@Arty Functions with the same type (`void()`) can be stored as a pointer-to-function: `void (*)()`. So `std::vector` can hold pointers to any functions that take no arguments and return `void`. `void f() { std::cout << "f\n"; }` and `void g() { std::cout << "g\n"; }` have the same type. `std::vector vec; vec.push_back(f); vec.push_back(g);` works just fine (modulo typos and auto-correction). – Pete Becker May 20 '21 at 15:02

Yakk - Adam Nevraumont · Answer 2 · 2021-05-20T13:59:03.310

std::function causes a few kinds of overhead.

First, it is difficult for the compiler to understand. If you had a raw function pointer, some compilers are able to "undo" the indirection easier than they can with std function. However, in these cases, often the raw function pointer and std function use was a bad one in the first place.

Second, typically how std function is implemented involves a virtual function table, which results in up to 2 indirections instead of the one of a function pointer. This hit is largest when the virtual function table falls out of your CPUs cache.

Third, C++ compilers are great at inlining, and indirection through a std function blocks that.

Now, in my experience, this overhead gets worst when you are doing buffer processing, such as a per-pixel operation.

In this case, you can have millions or billions of pixels you are working on. The work done per pixel is small, and the overhead of going through a std function call on each and every operation ends up being large compared to the actual work done.

The simplest way to solves this (and related) problems is to save a buffer processing function instead of a per-element function, like this.

using Pixel = std::uint32_t;
using Scanline = std::span<Pixel>;
using ScanlineOp = std::function<void(Scanline)>;

template<class PixelOp>
ScanlineOp MakeScanlineOp( PixelOp op ) {
  return [op=std::move(op)](Scanline line) {
    for (Pixel& p : line)
      op(p);
  };
}

here I take the per-pixel operation, and I save it along with iteration code into a std::function.

Now when processing a 4000 pixel by 4000 pixel image, instead of suffering std::function overhead 16 million times, I instead run into it 4000 times. Which reduces the cost of the overhead by 99.975% percent.

Make something 4000x faster a few times and you stop caring about how much it costs.

Now, std::span is a type not in c++11. Here is a toy version:

template<class It>
struct range {
    It b, e;
    using reference = typename std::iterator_traits<It>::reference;
    using value_type = typename std::iterator_traits<It>::value_type;

    range( It s, It f ):b(s), e(f) {}
    It begin() const { return b; }
    It end() const { return e; }
    bool empty() const { return begin()==end(); }
    reference front() const { return *begin(); }
};
template<class It>
struct random_range:range<It> {
    using range<It>::range;
    using reference = typename range<It>::reference;

    reference back() const { return *std::prev(this->end()); }
    std::size_t size() const { return this->end()-this->begin(); }
    reference operator[](std::size_t i) const{ return this->begin()[i]; }
};

template<class T>
struct array_view:random_range<T*> {
    array_view( T* start, T* finish ):random_range<T*>(start, finish) {}
    array_view( T* start, std::size_t length ):array_view(start, start+length) {}
    array_view():array_view(nullptr, nullptr) {}

    template<class C>
    using data_type = typename std::remove_pointer< decltype( std::declval<C>().data() )>::type;
    template<class U>
    static constexpr bool pointer_compatible() {
        return 
            std::is_same<
                typename std::decay<U>::type,
                typename std::decay<T>::type
            >::value
            && std::is_convertible<U*, T*>::value;
    }
    // accept any container whose 
    template<class C,
        typename std::enable_if< pointer_compatible<data_type<C>>(), bool >::type = true
    >
    array_view( C&& c ):array_view(c.data(), c.size()) {}
};

The complex part is where I accept vector or array because its .data() field exists, returns a compatible pointer.

You'd convert code that looks like:

void foreachPixel( PixelOp op, Image img ) {
  for (int i = 0; i < img.height(); ++i)
    for (int j = 0; j < img.width(); ++j)
      op(img[i][j]);
}

to

void foreachPixel( ScanlineOp op, Image img ) {
  for (int i = 0; i < img.height(); ++i)
    op(img.Scanline(i));
}

now, what I'm demonstrated is aimed at one concrete case. The general idea is that you can inject some of the low-level control flow into your std::function and operate one level higher, and thus remove almost all of the std::function overhead.

Why std::function is too slow is CPU can't utilize instruction reordering?

2 Answers2