2

I have the following loop which calls all function pointers in an array:

for(auto f : program) {
   f();
}

I'd like to optimize this. So far, I've tried two methods:

  1. Tail recursion
  2. JITting threaded code

Here is the complete test code: https://coliru.stacked-crooked.com/a/d639f024b1222c54

The timing results on my machine (iMac Pro 8-Core) are:

naive: 0.530534
tail recursion: 0.265192
JIT threaded: 0.125106

Of course the functions all have to be modified to facilitate tail recursion, but that's ok. What would be less pleasant in terms of code cleanliness would be to put everything in one function and use something like computed goto (I've tried that too, actually, and computed goto is only slightly faster than tail recursion on my machine.)

Can I do better than tail recursion without JITting? (on iOS, JITting is not allowed)

Note that the functions cannot be re-ordered.

Taylor
  • 5,871
  • 2
  • 30
  • 64
  • Are you guaranteeing the same memory access pattern for all tests? I imagine all the cache fetching and paging occurred on the first calls to those functions which would dramatically bias the results. Do your timing results change much if you test naive last? Or maybe exclude the first pass of execution from your timing results (because the first pass will include paging misses and CPU cache misses.) – Wyck Jul 06 '19 at 17:17
  • @Wyck good point. On my machine I'm using XCTest, which runs each test 10 times. The first run is slightly slower, but not particularly signifiant. I suspect that's because each test is doing 100m function calls. – Taylor Jul 06 '19 at 17:24

1 Answers1

1

Yes. We can in fact beat the threaded code without JITting.

The test code consists of 100 possible functions. I wrote a little program to generate code for a 100x100 array of functions which call pairs of those 100 functions. The optimizer inlines the original 100 into the pairs. We now have:

naive: 0.534162
tail recursion: 0.269307
JIT threaded: 0.124608
pairs: 0.085922

This technique could be generalized to real-world cases by analyzing common sequences of function calls, rather than generating all possible pairs.

This could be combined with tail recursion for even faster dispatch.

Taylor
  • 5,871
  • 2
  • 30
  • 64
  • That's impressive, even tho ultimately useless. Also using timings from this page i got: `naive: 1.040995 tail recursion: 0.698340 tail recursion: 0.130868 JIT threaded: 0.870852`. First tail recursion is yours and is faster than JIT (also do you measure jitting itself?). Second tail recursion is simply calling functions in order. It's way faster, it would be like 2-3 faster than your pair method, which suggests to me, than sorting functions' code in memory might be best idea. So make the functions relocatable, sort them by calling order and off you go. – Radosław Cybulski Jul 06 '19 at 16:58
  • @RadosławCybulski Timings are from my machine. As the test code implies, the function call order is not known at compile time, which means they cannot be sorted in memory. Furthermore, functions may be called more than once, so no such sorted order exists. – Taylor Jul 06 '19 at 17:14
  • Of course you can sort code. If the function code is relocatable, then you can simply move it in memory in the same principle, as you created your jump calls (essentially instead of writing jump calls you write whole function bodies). You can even write function body more than once (duplicate them), since they have to be extremely short (otherwise their running time will thrumph any call time penalty). – Radosław Cybulski Jul 06 '19 at 17:24
  • @RadosławCybulski Sure, ok, a more sophisticated jitter. Might as well just use LLVM at that point. The main point here is to achieve faster dispatch without having to JIT. I'll try to make that more clear in the question. – Taylor Jul 06 '19 at 17:32
  • @RadosławCybulski How would you determine the end of the function when copying? There could be data after the epilogue, right? – Taylor Jul 06 '19 at 17:42
  • Well, jump is the fastest dispatch. Any further improvements come from caching (or cache misses). It really depends, what you're trying to do. How often order changes. Can you precache something (like function length, or some other data), and so on. – Radosław Cybulski Jul 06 '19 at 17:47
  • @RadosławCybulski Apparently dynamic relocation of a function is not simple: https://stackoverflow.com/questions/2313048/dynamic-relocation-of-code-section – Taylor Jul 06 '19 at 18:01