0

I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:

function expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(i=0; i<VeryLargeN; i++){
       element=elements[i]
       // long computation that produce a boolean condition
       if (condition){ 
         do_true(element); 
       }else{
         do_false(element);
       }
    }
}

Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.

To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.

So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?

Antonio Ragagnin
  • 2,278
  • 4
  • 24
  • 39
  • 5
    There is no language C/C++. Do you use C or C++? And your question is not clear. This could be an XY problem, at least in C++ where a template might help. – too honest for this site Nov 29 '16 at 18:14
  • 2
    "*Executed* inline in C/C++" is the wrong terminology. Both C and C++ are usually compiled; inlining happens at the discretion of the compiler of choice, whether you specify it or not. Compile to assembler source code and you can find out. – Jongware Nov 29 '16 at 18:14
  • C != C++ and the "native" solution may vary. Also, did you test your code to see what parts were slow? – crashmstr Nov 29 '16 at 18:14
  • C or C++? It makes a big difference here. – aschepler Nov 29 '16 at 18:15
  • 2
    The C++ language has *lambdas* whereas the C language doesn't. The C++ language allows for *function objects*, whereas the C language doesn't. Please edit your question clarifying whether you are using lambdas, function pointers, or function objects. – Thomas Matthews Nov 29 '16 at 18:28
  • http://stackoverflow.com/a/2156899/3528438 – user3528438 Nov 29 '16 at 18:31
  • 2
    **Have you profiled your function** to determine where the execution time is spent? I would believe that dereferencing function pointers would not be where most of the time is spent (unless the passed functions are gobbling up execution time). – Thomas Matthews Nov 29 '16 at 18:32
  • FYI, in assembly language, the difference between calling a function by name and calling via function pointer should be negligible. Essentially, the processor must save the return location, then transfer execution to the function. This can be accomplished in less than 4 instructions. Your main penalty is reloading of the instruction cache or pipeline. You have have already done this by the `if` statement. – Thomas Matthews Nov 29 '16 at 18:34
  • @ThomasMatthews is right. If the condition evaluation takes time, the time spent in calling either true or false functions won't really impact the overall performance. It is better to focus on improving the condition evaluation. – Jorge Bellon Nov 29 '16 at 19:39

2 Answers2

3

Note that the function accepts pointers to functions, so those get called through a pointer. The optimizer may inline those calls through the function pointers if the definitions of expensive_loop and those functions are available and the compiler inlining limits have not been breached.

Another option is to make this algorithm a function template that accepts callable objects (function pointers, objects with a call operator, lambdas), just like standard algorithms do. This way the compiler may have more optimization opportunities. E.g.:

template<class DoTrue, class DoFalse>
void expensive_loop(DoTrue do_true, DoFalse do_false) { 
    // Original function body here.
}

There is -Winline compiler switch for g++:

-Winline

Warn if a function can not be inlined and it was declared as inline. Even with this option, the compiler will not warn about failures to inline functions declared in system headers.

The compiler uses a variety of heuristics to determine whether or not to inline a function. For example, the compiler takes into account the size of the function being inlined and the the amount of inlining that has already been done in the current function. Therefore, seemingly insignificant changes in the source program can cause the warnings produced by -Winline to appear or disappear.

It probably does not warn about a function not being inlined when it is called through a pointer.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • The definition need to be available, and typically `expensive_loop` will need to be inlined for `do_true` etc. to be too. If it's anything like `std::sort`, you likely won't have inlining, unlike the function object case. – T.C. Nov 29 '16 at 18:26
  • @T.C. Yep, there are compiler inlining limits. – Maxim Egorushkin Nov 29 '16 at 18:28
  • My understanding is that in order to pass a function by pointer (address), it must exist somewhere (and not inlined). So how can a compiler inline the function pointed to by a function pointer? – Thomas Matthews Nov 29 '16 at 18:29
  • @ThomasMatthews It uses constant propagation for that. In `f(g)` the compiler knows that it is the address of `g` and it can just call/inline `g` directly when inlining `f`. – Maxim Egorushkin Nov 29 '16 at 18:31
  • see this answer http://stackoverflow.com/a/2156899/3528438 . this method works well if the functions are functors (one that has the function code in it's class definition, which must be visible). However OP's works with C style function pointers, so the code of those operations is hard for the compiler to find and depends completely the compiler's static code analysis. – user3528438 Nov 29 '16 at 18:34
  • @user3528438 Function pointers do inline as well. You may like to verify your own comment. – Maxim Egorushkin Nov 29 '16 at 18:36
  • Pardon my naivety, but the compiler can't know which function is passed to the `expensive_loop` function. In the code external to this function, there may be one call or there may be many to `expensive_loop`, each one could pass a different function. So I don't see how the *compiler* can replace the dereferencing of the function pointer with an inlined function. – Thomas Matthews Nov 29 '16 at 18:38
  • @ThomasMatthews Write a small test program and observe. Use a modern compiler though, or http://gcc.godbolt.org/. – Maxim Egorushkin Nov 29 '16 at 18:40
  • 1
    If the functions `do_true` and `do_false` are declared on a different translation unit, they are not inlined into `expensive_loop`. `-Winline` does not warn about this inlining not being performed. – Jorge Bellon Nov 29 '16 at 19:32
  • @JorgeBellón You conveniently ignored what i wrote in the second sentence. – Maxim Egorushkin Nov 29 '16 at 19:35
  • @MaximEgorushkin I have not. The compiler does not perform a warning when both `inline` keyword and the `-Winline` flag are set. However, the linker is not able to find the symbols and fails with an undefined reference error. – Jorge Bellon Nov 29 '16 at 19:41
  • @JorgeBellón Then why do you keep banging about functions in a different translation unit? And if you read the docs, `-Winline` does not guarantee a warning in every case. – Maxim Egorushkin Nov 29 '16 at 20:02
  • I say so because the optimization is not guaranteed to be applied in all cases. It can produce optimized code, valid code without inlining or it can produce invalid object files without compilation errors that may well produce errors at link time. Using functors warrantees that the compilation stage either fails (type is not available) or performs the optimization (inline in this case) as expected. – Jorge Bellon Nov 29 '16 at 23:07
  • _the optimization is not guaranteed to be applied in all cases_ and i never said it was. – Maxim Egorushkin Nov 29 '16 at 23:11
2

The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.

If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.

struct function_one {
  void operator()( int element ) {
  }
};

extern int elements[];
extern bool condition();

template < typename DoTrue, typename DoFalse >
void expensive_loop(){
  DoTrue do_true;
  DoFalse do_false;

  for(int i=0; i<50; i++){
    int element=elements[i];
    // long computation that produce a boolean condition
    if (condition()){ 
      do_true(element); // call DoTrue's operator()
    }else{
      do_false(element); // call DoFalse's operator()
    }
  }
}

int main( int argc, char* argv[] ) {
    expensive_loop<function_one,function_one>();

return 0;
}

The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.

For the example I shown, note how the function is empty. The compiler just strips away the function call and leaves the loop:

main:
    push    rbx
    mov     ebx, 50
.L2:
    call    condition()
    sub     ebx, 1
    jne     .L2
    xor     eax, eax
    pop     rbx
    ret

See example in https://godbolt.org/g/hV52Nn

Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop

// File A.cpp
void foo( int arg );
void bar( int arg );

extern bool condition();
extern int elements[];

void expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(int i=0; i<50; i++){
       int element=elements[i];
       // long computation that produce a boolean condition
       if (condition()){
         do_true(element);
       }else{
         do_false(element);
       }
    }
}

int main( int argc, char* argv[] ) {
    expensive_loop( foo, bar );

    return 0;
}

and the functions passed by argument

// File B.cpp
#include <math.h>

int elements[50];

bool condition() {
    return elements[0] == 1;
}

inline int foo( int arg ) {
    return arg%3;
}

inline int bar( int arg ) {
    return 1234%arg;
}

are defined in different translation units.

0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
  400620:       41 55                   push   %r13
  400622:       49 89 fd                mov    %rdi,%r13
  400625:       41 54                   push   %r12
  400627:       49 89 f4                mov    %rsi,%r12
  40062a:       55                      push   %rbp
  40062b:       53                      push   %rbx
  40062c:       bb 60 10 60 00          mov    $0x601060,%ebx
  400631:       48 83 ec 08             sub    $0x8,%rsp
  400635:       eb 19                   jmp    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
  400637:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40063e:       00 00
  400640:       48 83 c3 04             add    $0x4,%rbx
  400644:       41 ff d5                callq  *%r13
  400647:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
  40064e:       74 1d                   je     40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
  400650:       8b 2b                   mov    (%rbx),%ebp
  400652:       e8 79 ff ff ff          callq  4005d0 <condition()>
  400657:       84 c0                   test   %al,%al
  400659:       89 ef                   mov    %ebp,%edi
  40065b:       75 e3                   jne    400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
  40065d:       48 83 c3 04             add    $0x4,%rbx
  400661:       41 ff d4                callq  *%r12
  400664:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
  40066b:       75 e3                   jne    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
  40066d:       48 83 c4 08             add    $0x8,%rsp
  400671:       5b                      pop    %rbx
  400672:       5d                      pop    %rbp
  400673:       41 5c                   pop    %r12
  400675:       41 5d                   pop    %r13
  400677:       c3                      retq
  400678:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40067f:       00

You can see how the calls are still performed even when using -O3 optimization level:

400644:       41 ff d5                callq  *%r13
Jorge Bellon
  • 2,901
  • 15
  • 25
  • 1
    You may be surprised to learn that modern compilers do optimize calls through function pointers and member function pointers. – Maxim Egorushkin Nov 29 '16 at 18:22
  • They may do, but this way you make sure it does what you expect (for example, it makes sure that the definition is available in the compilation unit of `expensive_loop` definition while, on the other case it is only required at the time the latter function is called. – Jorge Bellon Nov 29 '16 at 18:38