4

I have performance critical code written for multiple CPUs. I detect CPU at run-time and based on that I use appropriate function for the detected CPU. So, now I have to use function pointers and call functions using these function pointers:

void do_something_neon(void);
void do_something_armv6(void);

void (*do_something)(void);

if(cpu == NEON) {
    do_something = do_something_neon;
}else{
    do_something = do_something_armv6;
}

//Use function pointer:
do_something(); 
...

Not that it matters, but I'll mention that I have optimized functions for different cpu's: armv6 and armv7 with NEON support. The problem is that by using function pointers in many places the code become slower and I'd like to avoid that problem.

Basically, at load time linker resolves relocs and patches code with function addresses. Is there a way to control better that behavior?

Personally, I'd propose two different ways to avoid function pointers: create two separate .so (or .dll) for cpu dependent functions, place them in different folders and based on detected CPU add one of these folders to the search path (or LD_LIB_PATH). The, load main code and dynamic linker will pick up required dll from the search path. The other way is to compile two separate copies of library :) The drawback of the first method is that it forces me to have at least 3 shared objects (dll's): two for the cpu dependent functions and one for the main code that uses them. I need 3 because I have to be able to do CPU detection before loading code that uses these cpu dependent functions. The good part about the first method is that the app won't need to load multiple copies of the same code for multiple CPUs, it will load only the copy that will be used. The drawback of the second method is quite obvious, no need to talk about it.

I'd like to know if there is a way to do that without using shared objects and manually loading them at runtime. One of the ways would be some hackery that involves patching code at run-time, it's probably too complicated to get it done properly). Is there a better way to control relocations at load time? Maybe place cpu dependent functions in different sections and then somehow specify what section has priority? I think MAC's macho format has something like that.

ELF-only (for arm target) solution is enough for me, I don't really care for PE (dll's).

thanks

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
Pavel P
  • 15,789
  • 11
  • 79
  • 128
  • 3
    You do realize that calling a function in a DLL requires, in the best case, *two* direct jumps, whereas using pointers requires only *one* indirect jump? Also, if mature, performance-savvy projects like x264 use function pointers thoroughly in their codebase for CPU-specific code, I'd say they have good reasons to do so. – CAFxX Mar 11 '12 at 00:55
  • You are wrong. You need to know how to properly write dll's or shared object to make sure that there is no jumps. I wrote PE loaders and I know how it works from inside. Hint:__declspec dllimport/dllexport removes that jump. Here's the good src of info: [dsohowto.pdf](http://www.akkadia.org/drepper/dsohowto.pdf) – Pavel P Mar 11 '12 at 01:51
  • regarding x264: i'm quite familiar with their codebase. Most likely you can get 2-3% boost (if not more) if these pointers are avoided. It's flexibility/speed treadeoff. For intel's there is alot of duplicate functions for all kinds of sse/sse3 etc etc, not as much as for arm optimizations. – Pavel P Mar 11 '12 at 01:56
  • 1
    Look at what glibc does to select the best memcpy for your arch at runtime (on x86). Maybe that is good enough for your needs. – Z.T. Mar 11 '12 at 06:17
  • @Z.T. that could be what I'm looking for. Where can I read about that? Searching for memcpy or glibc returns tons of unrelated info – Pavel P Mar 11 '12 at 19:01
  • I just ran a trivial one liner in gdb. On the first call to `memcpy@PLT` (the dynamic linker calls its statically linked memcpy before that), you get into the dynamic linker. There through `_dl_runtime_resolve` and `_dl_fixup` you get into `memcpy@@GLIBC_2.14` which is this: http://repo.or.cz/w/glibc.git/blob/HEAD:/sysdeps/x86_64/multiarch/mempcpy.S#l27 which returns the right function pointer for your machine which is called. Next time the PLT points to `__memcpy_ssse3`. – Z.T. Mar 11 '12 at 20:55

4 Answers4

7

You may want to lookup the GNU dynamic linker extension STT_GNU_IFUNC. From Drepper's blog when it was added:

Therefore I’ve designed an ELF extension which allows to make the decision about which implementation to use once per process run. It is implemented using a new ELF symbol type (STT_GNU_IFUNC). Whenever the a symbol lookup resolves to a symbol with this type the dynamic linker does not immediately return the found value. Instead it is interpreting the value as a function pointer to a function that takes no argument and returns the real function pointer to use. The code called can be under control of the implementer and can choose, based on whatever information the implementer wants to use, which of the two or more implementations to use.

Source: http://udrepper.livejournal.com/20948.html

Nonetheless, as others have said, I think you're mistaken about the performance impact of indirect calls. All code in shared libraries will be called via a (hidden) function pointer in the GOT and a PLT entry that loads/calls that function pointer.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • if it's not pic then I can't see a reason why it can't be. Maybe on windows it can and it can't be on ELF? – Pavel P Mar 12 '12 at 04:07
  • What? Calls from the main program to a shared library always go through the PLT. Otherwise the main program would have textrels. Actually there might be a way to modify the linker scripts and get it to generate textrels in the main program in place of PLT entries, which could achieve what you want. Combined with STT_GNU_IFUNC it would be exactly what you want. – R.. GitHub STOP HELPING ICE Mar 12 '12 at 04:11
  • on Windows calls to dll might go directly without extra levels of indirection. I assume that textrels are the type of relocs that linker writes into the executable, right? – Pavel P Mar 12 '12 at 09:53
  • Textrel means a runtime modification to the text (code) section of the executable. – R.. GitHub STOP HELPING ICE Mar 12 '12 at 12:42
  • R.. is right. GCCs IFUNC goes through a hidden pointer in a PLT. A Windows DLL call also goes through an import table. The only way to avoid the extra overhead is to do the CPU dispatching outside the critical inner loop. In other words: Identify the critical inner loop. Make one copy of this loop for each CPU version. Do the CPU dispatching at the higher level outside this loop. A DLL or .so generally has more overhead and less efficient caching than static linking. – A Fog Jun 02 '12 at 06:03
4

For the best performance you need to minimize the number of indirect calls (through pointers) per second and allow the compiler to optimize your code better (DLLs hamper this because there must be a clear boundary between a DLL and the main executable and there's no optimization across this boundary).

I'd suggest doing these:

  1. moving as much of the main executable's code that frequently calls DLL functions into the DLL. That'll minimize the number of indirect calls per second and allow for better optimization at compile time too.
  2. moving almost all your code into separate CPU-specific DLLs and leaving to main() only the job of loading the proper DLL OR making CPU-specific executables w/o DLLs.
Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
  • Alex, I know how to minimize impact of function pointers. But, I asked if it was possible to avoid them. In short, if it's possible this should involve some sort of low level interaction with the run-time loader. I'm pretty sure that it's impossible with windows dll (unless you write your own dll loader). ELF ld is more flexible, and I'm not sure if such thing is possible some way or another with ELF shared objects. – Pavel P Mar 11 '12 at 03:01
  • 1
    Normally, if your program is split in parts (executable + dynamically loaded library), once it's compiled and linked, there cannot be any further compile-time or link-time or run-time optimizations anymore (unless, of course, we're talking about something like a just-in-time compiler with run-time optimizer, see [this question](http://stackoverflow.com/q/780349/968261), it mentions an interesting project, [HP's Dynamo](http://www.hpl.hp.com/techreports/1999/HPL-1999-78.html)). – Alexey Frunze Mar 11 '12 at 08:57
  • No, it's not about JIT. Actually, I tried to see inside google's V8 (I care for ARM only) and I don't think it's usable outside their code: it has no docs, no examples, I wasn't able to do anything at all. I think Z.T. posted what I was looking for: "glibc does to select the best memcpy for your arch at runtime", as long as it doesn't simply set a pointer at runtime. One other possible solution would be to modify code (overwrite it with better version at runtime) but that would work only with leaf functions. – Pavel P Mar 11 '12 at 18:58
  • I'm pretty sure glibc uses either if's or function pointers (exactly what you showed in your question) without any hacks. – Alexey Frunze Mar 11 '12 at 19:48
2

Here's the exact answer that I was looking for.

 GCC's __attribute__((ifunc("resolver")))

It requires fairly recent binutils.
There's a good article that describes this extension: Gnu support for CPU dispatching - sort of...

Pavel P
  • 15,789
  • 11
  • 79
  • 128
0

Lazy loading ELF symbols from shared libraries is described in section 1.5.5 of Ulrich Drepper's DSO How To (updated 2011-12-10). For ARM it is described in section 3.1.3 of ELF for ARM.

EDIT: With the STT_GNU_IFUNC extension mentioned by R. I forgot that was an extension. GNU Binutils supports that for ARM, apparently since March 2011, according to changelog.

If you want to call functions without the indirection of the PLT, I suggest function pointers or per-arch shared libraries inside which function calls don't go through PLTs (beware: calling an exported function is through the PLT).

I wouldn't patch the code at runtime. I mean, you can. You can add a build step: after compilation disassemble your binaries, find all offsets of calls to functions that have multi-arch alternatives, build table of patch locations, link that into your code. In main, remap the text segment writeable, patch the offsets according to the table you prepared, map it back to read-only, flush the instruction cache, and proceed. I'm sure it will work. How much performance do you expect to gain by this approach? I think loading different shared libraries at runtime is easier. And function pointers are easier still.

Z.T.
  • 939
  • 8
  • 20
  • I also mentioned that dsohowto.pdf in some of the first comments to this question. Some time ago I wanted to write simple ELF loader and that's where from I had that idea of resolving functions at runtime. Simply couldn't remember how that could be done. – Pavel P Mar 12 '12 at 04:13