1

I am trying to test the fastest way to call a function pointer to get around templates for a finite amount of arguments. I wrote this benchmark: https://gcc.godbolt.org/z/T1qzTd

I am noticing that function pointers to class member functions have a lot of added overhead that I am having trouble understanding. What I mean is the following:

With a struct bar and function foo defined as follows:

template<uint64_t r>
struct bar {
    template<uint64_t n>
    uint64_t __attribute__((noinline))
    foo() {
        return r * n;
    }
    
    // ... function pointers with pointers to versions of foo below

The first option (in #define DO_DIRECT in the godbolt code) calls the templated function by indexing into an array of function pointers to class member function defined as

   /* all of this inside of struct bar */
   typedef uint64_t (bar::*foo_wrapper_direct)();
   const foo_wrapper_direct call_foo_direct[NUM_FUNCS] = {
      &bar::foo<0>,
      // a bunch more function pointers to templated foo...
   };

   // to call templated foo for non compile time input
   uint64_t __attribute__((noinline)) foo_direct(uint64_t v) {
      return (this->*call_foo_direct[v])();
   }
   

The assembly for this, however, appears to have a TON of fluff:

bar<9ul>::foo_direct(unsigned long):
        salq    $4, %rsi
        movq    264(%rsi,%rdi), %r8
        movq    256(%rsi,%rdi), %rax
        addq    %rdi, %r8
        testb   $1, %al
        je      .L96
        movq    (%r8), %rdx
        movq    -1(%rdx,%rax), %rax
.L96:
        movq    %r8, %rdi
        jmp     *%rax

Which I am having trouble understanding.

In contrast the #define DO_INDIRECT method defined as:

// forward declare bar and call_foo_wrapper
template<uint64_t r>
struct bar;

template<uint64_t r, uint64_t n>
uint64_t call_foo_wrapper(bar<r> * b);


/* inside of struct bar */
typedef uint64_t (*foo_wrapper_indirect)(bar<r> *);
const foo_wrapper_indirect call_foo_indirect[NUM_FUNCS] = {
    &call_foo_wrapper<r, 0>
    // a lot more templated versions of foo ...
};

uint64_t __attribute__((noinline)) foo_indirect(uint64_t v) {
    return call_foo_indirect[v](this);
}
/* no longer inside struct bar */

template<uint64_t r, uint64_t n>
uint64_t
call_foo_wrapper(bar<r> * b) {
    return b->template foo<n>();
}

has some very simple assembly:

bar<9ul>::foo_indirect(unsigned long):
        jmp     *(%rdi,%rsi,8)

I am trying to understand why the DO_DIRECT method using function pointers directly to the class member function has so much fluff, and how, if possible, I can change it so remove the fluff.

Note: I have the __attribute__((noinline)) just to make it easier to examine the assembly.

Thank you.

p.s if there is a better way of converting runtime parameters to template parameters I would appreciate a link the an example / manpage.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Noah
  • 1,647
  • 1
  • 9
  • 18
  • Because in one case one has to: 1) multiply an index by sizeof(pointer), 2) fetch the corresponding value from a read-only array, 3) deal with several different possibilities of interpreting the function pointer in PIC code, and 4) finally call the function through a pointer, and in the other case it jumps directly to step 4? – Sam Varshavchik Sep 27 '20 at 21:51
  • Looks like `addq %rdi, %r8` is offsetting `this` in case of a possible sub-object? I'm also not sure why it needs to branch on the low bit of the member-function pointer. But note that member-function pointers are 16 bytes. BTW, your dispatch table isn't `static`, so there's a copy of it in *each* instance of the class. That's why it's indexing a big offset from the incoming `this`. It would probably still work the same without that, but it's massively inefficient. – Peter Cordes Sep 27 '20 at 22:01
  • @SamVarshavchik Can you explain step 3) in a little more detail or point me to some link. Why is DO_INDIRECT not PIC? – Noah Sep 27 '20 at 22:12
  • @PeterCordes Sorry this probably a newbie question but when I was trying to make it static I'm getting undefined reference errors in both DO_INDIRECT and DO_DIRECT? Is there more just "static constexpr" necessary? – Noah Sep 27 '20 at 22:13
  • @PeterCordes I could make it static if I didn't have template parameter r but that is pretty critical to the class. I think my use case only requests 1 instantiation. Thank you! – Noah Sep 27 '20 at 22:28
  • Have you considered just writing a `switch{case}` statement for dispatch? That might allow a simple jump table of 8-byte entries. – Peter Cordes Sep 27 '20 at 23:13
  • 1
    A pointer-to-member-function value does not include the object's address/identity/etc. When a class template declares a `static` member, that means one object per instantiated class. So it should be possible for `bar` to have a static array member containing the pointers `&bar::foo<0>`, etc. If you're having trouble with the syntax for that, that might be worth another question. – aschepler Sep 27 '20 at 23:55
  • Also, I think you could use a recursive CPP macro to expand to an initializer list, instead of hard-coding each entry, perhaps using Boost preprocessor stuff. Or tricks like [C++ preprocessor conditional parameter](https://stackoverflow.com/q/32179717) to allow 1 line per initializer, like `MAYBE_FPTR(5)` `MAYBE_FPTR(6)`. And BTW, [Proper initialization of static constexpr array in class template?](https://stackoverflow.com/q/14395967) shows how to instantiate static constexpr data with an initializer in the template, but your case might be harder because it depends on the template param – Peter Cordes Sep 27 '20 at 23:55
  • 1
    @PeterCordes I tested a switch statement but it performs noticeably worse just running in a loop and it will just be more burden on the icache once its actually used in a program. – Noah Sep 28 '20 at 00:39
  • @aschepler Ah, I see. Thank you! – Noah Sep 28 '20 at 00:40
  • @aschepler and Noah: yup, in C++14 and earlier, you need to explicitly instantiate the template, even when it's constexpr initialized. https://gcc.godbolt.org/z/W788o3 has one way to do it, with `constexpr typename bar::foo_wrapper_direct bar::direct_foo_table[NUM_FUNCS];` at global scope. C++17 just instantiates for you if/when the class is used. But in earlier C++, that's why you get an undefined reference when linking if you didn't manually instantiate the template static constexpr member. – Peter Cordes Sep 28 '20 at 00:55
  • However, if you *do* only have one long-lived instance, pointers in that instance might be better for efficiency than a static array, especially in PIE code. Or *especially* in PIC code where static data would indirect through the GOT. Also, I'm not totally surprised that `switch` was worse: in theory it could probably optimize better (like best of both worlds: a tailcall from a table of 8-byte pointers directly to the normal member-function addresses, no extra adjustment of `this` or whatever), but in practice compilers suck. – Peter Cordes Sep 28 '20 at 00:59
  • @PeterCordes The word "instantiate" confused me for a bit. You mean "define". – aschepler Sep 28 '20 at 14:06
  • @aschepler: No, I don't. https://en.cppreference.com/w/cpp/language/class_template talks about explicit vs. implicit instantiation; the process of getting the compiler to emit an instance of the template (an actual asm definition of a function or data object) for a specific template parameter. Static members have to be declared in the template class definition (in the .h), but are normally instantiated in only one `.cpp` source file. (To be fair, I don't 100% understand all the rules, like why I didn't need to say `bar<9>::direct_foo_table[]` to instantiate for that param explicitly. – Peter Cordes Sep 28 '20 at 19:56
  • @PeterCordes But there is no explicit instantiation here. "Explicit instantiation" is the specific syntax which starts with `template` or `extern template` and no template parameter list - [\[temp.explicit\]](https://timsong-cpp.github.io/cppwp/temp.explicit). An ODR-use (before any explicit instantiation or explicit specialization) of a static data member of a class template causes the implicit instantiation of that member, which means that C++ actually requires a definition to exist, leading to a link error if no translation unit has the definition. – aschepler Sep 29 '20 at 12:45
  • @aschepler: ah right, thanks. I did mean instantiate, but apparently explicit wasn't required here, and wasn't what I was doing. To be honest my template knowledge is somewhat rusty. I did kinda realize that not having to ever say `<9>` meant it wasn't the kind of think I thought I was going to need. I still don't get why C++14 and earlier couldn't treat the definition inside the class itself as enough of a definition to trigger instantiation on use, if explicit instantiation wasn't required. (But I'm not interested enough in template details at the moment to grok the language-design details.) – Peter Cordes Sep 29 '20 at 12:56
  • @PeterCordes I think the big reason is: if the member in the class definition is a definition, it's likely in a header file and likely appears in multiple TUs. So we would need some linker technique to make sure the multiple definitions don't conflict but end up as just one object with one address. C++14 and earlier specify this sort of thing only for inline functions, but C++17 is the first to specify rules for "inline variables". So of course C++17 also states that a static data member declared with `constexpr` is implicitly `inline` to get rid of the issue. – aschepler Sep 29 '20 at 13:11
  • @aschepler: If you had `template constexpr int bar::arr[SIZE]` in one `.cpp` file, how would that TU know which `x` values to instantiate the array for? It can't see into TUs to know which x values were actually used, and array can be distinct between different template parameter values. (Right?) But yes, having the linker discard / merge multiple definitions is clearly necessary for it to work without requiring one explicit instantiation in one TU for each template parameter used. Fortunately we have the technology thanks to inline functions. (And to string literal dup merging) – Peter Cordes Sep 29 '20 at 13:30
  • @aschepler: Or would C++14 and earlier have errored (undefined reference) if the definition was in one TU but the use in another? Or errored on multiple definitions if I repeated the definition in each TU and used the same template param in each. (Sorry I'm being lazy and not trying this; not expecting you to answer if you don't want.) – Peter Cordes Sep 29 '20 at 13:34
  • 1
    @PeterCordes In all versions, if the object `bar::arr` is odr-used for some particular value of `x`, then the definition must be instantiated with that same `x`. In C++14, a static data member declaration in the class definition never counts as a definition, even if it is `constexpr` and has an initializer. If a TU which does not contain the definition odr-uses the object, this implicitly instantiates the member's declaration only, and the definition must exist. – aschepler Sep 30 '20 at 22:25
  • ... Which means another TU which does contain the definition must explicitly or implicitly instantiate the same object with the same `x` value, where both the declaration and definition will be instantiated. If no other TU does, the program is ill-formed, no diagnostic required. But a use of a `constexpr` object's value without taking its address or binding a reference to it doesn't count as an odr-use, so it's sometimes fine not to have the definition in C++14. – aschepler Sep 30 '20 at 22:29
  • ... Also, sometimes when binding a function parameter reference to the `constexpr` object makes the program IFNDR, but an optimizer inlines that function, there's no longer any need for the object's address, and so there's no linker error after all. Which can get people noticing a successful release build but link errors on debug build. In C++17, a static data member which is (possibly implicitly) `constexpr` with initializer in the class definition does count as a definition. What C++14 called its definition is now a redeclaration outside the class, allowed for compatibility, but no effect. – aschepler Sep 30 '20 at 22:34

1 Answers1

3

A C++ pointer-to-member-function must be capable of pointing at a non-virtual function or a virtual function. In a typical vtable/vptr implementation, calling a virtual function involves finding the correct code address from the vptr in the object expression and possibly applying an offset to the object parameter address.

g++ uses the Itanium ABI, so the assembly for foo_direct is interpreting the accessed pointer-to-member-function value as described in section 2.3. It finds the code address via the object expression's vptr if the function is virtual, or just copies the code address from the pointer-to-member value if not virtual.

I suppose an optimization might be able to skip the virtual function call logic if it can see that the class type has no virtual functions and is final. I'm not aware if g++ or other compilers have any such optimization, though.

aschepler
  • 70,891
  • 9
  • 107
  • 161
  • BTW, I think the `add` is there to support the case where the member function was defined *in a parent class*. In that case, `this` for the parent class instance could be different because of a different vtable. In the OP's use case, those are all `0`. – Peter Cordes Sep 27 '20 at 23:58
  • Possibly a `switch{ case: }` table would be a better way to get the compiler to build a dispatch table, although it might still have 2 levels of jumps (switch and then call) if it doesn't jump directly to the member function pointer. – Peter Cordes Sep 28 '20 at 00:00