Cache-Friendliness Is Close to Godliness
Knowing nothing else about how the update
methods of individual Controllers behave, I think the most important factor in performance would be cache-friendliness.
Considering cache effectiveness, the only difference between the two loops is that m_objects
are laid out contiguously (because they are contained in the vector) and they are accessed linearly in memory (because the loop is in order) but m_controllers
are only pointed to here and they can be anywhere in memory and moreover, they can be of different types with different update()
methods that themselves can reside anywhere. Therefore, while looping over them we would be jumping around in memory.
In respect to cache, the two loops would behave like this: (things are never simple and straightforward when you are concerned about performance, so bear with me!)
- Loop A: The inner loop runs efficiently (unless the objects are large - hundreds or thousands of bytes - or they store their data outside themselves, e.g.,
std::string
) because the cache access pattern is predictable and the CPU will prefetch consecutive cachelines so there won't be much stalling on reading memory for the objects. However, if the size of the vector of objects is larger than the size of the L2 (or L3) cache, each iteration of the outer loop will require reloading of the entire cache. But again, that cache reloading will be efficient!
- Loop B: If indeed the controllers have many different types of
update()
methods, the inner loop here may cause wild jumping around in memory, but all these different update functions will be working on data that is cached and available (specially if objects are large or they themselves contain pointers to data scattered in memory.) Unless the update()
methods access so much memory themselves (because, e.g., their code is huge or they require large amount of their own - i.e. controller - data) that they thrash the cache on each invocation; in which case all bets are off anyways.
So, I suggest the following strategy generally, which requires information that you probably don't have:
- If objects are small (or smallish!) and POD-like (don't contain pointers themselves) definitely prefer loop A.
- If objects are large and/or complex, or if there are many many different types of complex controllers (hundreds or thousands of different
update()
methods) prefer loop B.
- If objects are large and/or complex, and there are so very many of them that iterating over them will thrash the cache many times (millions of objects), and the
update()
methods are many and they are very large and complex and require a lot of other data, then I'd say the order of your loop doesn't make any difference and you need to consider redesigning objects and controllers.
Sorting the Code
If you can, it may be beneficial to sort the controllers based on their type! You can use some internal mechanism in Controller
or something like typeid()
or another technique to sort the controllers based on their type, so the behavior of consecutive update()
passes become more regular and predictable and nice.
This is a good idea regardless of which loop order you choose to implement, but it will have much more effect in loop B.
However, if you have so much variation among controllers (i.e. if practically all are unique) this won't help much. Also, obviously, if you need to preserve the order that controllers are applied, you won't be able to do this.
Adaptation and Improvisation
It should not be hard to implement both loop strategies and select between them at compile-time (or even runtime) based on either user hint or based on information available at compile time (e.g. size of T
or some traits of T
; if T is small and/or a POD, you probably should use loop A.)
You can even do this at runtime, basing your decision on the number of objects and controllers and anything else you can find out about them.
But, these kinds of "Klever" tricks can get you into trouble as the behavior of your container will depend on weird, opaque and even surprising heuristics and hacks. Also, they might and will even hurt performance in some cases, because there are many other factors meddling in performance of these two loops, including but not limited to the nature of the data and the code in objects and controllers, the exact sizes and configurations of cache levels and their relative speeds, the architecture of CPU and the exact way it handles prefetching, branch prediction, cache misses, etc., the code that the compiler generates, and much more.
If you want to use this technique (implementing both loops and switching between them are compile- and/or run-time) I highly suggest that you let the user do the choosing. You can accept a hint about which update strategy to use, either as a template parameter or a constructor argument. You can even have two update functions (e.g. updateByController()
and updateByObject()
) that the user can call at will.
On Branch Prediction
The only interesting branch here is the virtual update
call, and as an indirect call through two pointers (the pointer to the controller instance and then the pointer to its vtable) it is quite hard to predict. However, sorting controllers based on type will help immensely with this.
Also remember that a mispredicted branch will cause a stall of a few to a few dozen CPU cycles, but for a cache miss, the stall will be in the hundreds of cycles. Of course, a mispredicted branch can cause a cache miss too, so... As I said before, nothing is simple and straightforward when it comes to performance!
In any case, I think cache friendliness is by far the most important factor in performance here.