In generic object update loop, is it better to update per controller or per object?

Question

I'm writing some generic code which basically will have a vector of objects being updated by a set of controllers.

The code is a bit complex in my specific context but a simplification would be:

template< class T >
class Controller
{ 
public:
    virtual ~Controller(){}
    virtual void update( T& ) = 0;
    // and potentially other functions used in other cases than update
}

template< class T >
class Group
{
public:
    typedef std::shared_ptr< Controller<T> > ControllerPtr;

    void add_controller( ControllerPtr );    // register a controller
    void remove_controller( ControllerPtr ); // remove a controller

    void update(); // udpate all objects using controllers

private:

    std::vector< T > m_objects;
    std::vector< ControllerPtr > m_controllers;
};

I intentionally didn't use std::function because I can't use it in my specific case. I also intentionally use shared pointers instead of raw pointers, this is not important for my question actually.

Anyway here it's the update() implementation that interest me. I can do it two ways.

A) For each controller, update all objects.

template< class T >
void Group<T>::update()
{
    for( auto& controller : m_controllers )
        for( auto& object : m_objects )
            controller->update( object );
}

B) For each object, update by applying all controllers.

template< class T >
void Group<T>::update()
{
    for( auto& object : m_objects )
        for( auto& controller : m_controllers )
            controller->update( object );
}

"Measure! Measure! Measure!" you will say and I fully agree, but I can't measure what I don't use. The problem is that it's generic code. I don't know the size of T, I just assume it will not be gigantic, maybe small, maybe still a bit big. Really I can't assume much about T other than it is designed to be contained in a vector. I also don't know how many controllers or T instances will be used. In my current use cases, there would be widely different counts.

The question is: which solution would be the most efficient in general?

I'm thinking about cache coherency here. Also, I assume this code would be used on different compilers and platforms.

My guts tells me that updating instruction cache is certainly faster than updating data cache, which would make solution B) the more efficient in general. However, I learnt to not trust my gusts when I have doubts about performance, so I'm asking here.

The solution I'm getting to would allow the user to choose (using a compile-time policy) which update implementation to use with each Group instance, but I want to provide a default policy and I can't decide which one would be the most efficient for most of the cases.

you could create dummy types for checking. vary the size of the type (from size of primitive, to as large as you think is relevant), as well as the size of the corresponding update function, and take measurements with those. — EHuhtala, Apr 05 '13 at 22:18

score 2 · Answer 1 · edited May 23 '17 at 12:04

2

We have a living proof that modern compilers (Intel C++ in particular) are able to swap loops, so it shouldn't really matter for you.

I have remembered it from the great @Mysticial's answer:

Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate!

Wikipedia article about the topic

Detecting whether loop interchange can be done requires checking if the swapped code will really produce the same results. In theory it could be possible to prepare classes that won't allow for the swap, but then again, it could be possible to prepare classes that would benefit from either version more.

edited May 23 '17 at 12:04

Community

1
1

answered Apr 05 '13 at 22:09

Bartek Banachewicz

38,596
7
91
135

1

O__O what really? gcc, clang and msvc? – Klaim Apr 05 '13 at 22:11
3

Please don't post until your post is complete! This is not a duel at dawn. – Lightness Races in Orbit Apr 05 '13 at 22:14
1

@LightnessRacesinOrbit My goal is nevertheless to provide a good answer. Whether I do it at once or by means of iterating improvements, I think it's the final result that matters. Perhaps if SO had "don't post yet" ("draft") option, I would use it. – Bartek Banachewicz Apr 05 '13 at 22:27
I'll need to find more data to see if I can assume the most used free compilers does this optimization. – Klaim Apr 05 '13 at 22:35
I think this is very unlikely if there is actual virtual call inside the loop. – zch Apr 05 '13 at 22:41
1

@BartekBanachewicz: Your computer has a "notepad" facility or similar, I am sure. – Lightness Races in Orbit Apr 05 '13 at 22:59
Basically, I found that loop unrolling just cannot work in this case because it's not juste processing raw data and virtual calls don't help. So this answer actually doesn't help. – Klaim Jan 12 '16 at 19:06

score 1 · Answer 2 · answered Apr 06 '13 at 11:04

Cache-Friendliness Is Close to Godliness

Knowing nothing else about how the update methods of individual Controllers behave, I think the most important factor in performance would be cache-friendliness.

Considering cache effectiveness, the only difference between the two loops is that m_objects are laid out contiguously (because they are contained in the vector) and they are accessed linearly in memory (because the loop is in order) but m_controllers are only pointed to here and they can be anywhere in memory and moreover, they can be of different types with different update() methods that themselves can reside anywhere. Therefore, while looping over them we would be jumping around in memory.

In respect to cache, the two loops would behave like this: (things are never simple and straightforward when you are concerned about performance, so bear with me!)

Loop A: The inner loop runs efficiently (unless the objects are large - hundreds or thousands of bytes - or they store their data outside themselves, e.g., std::string) because the cache access pattern is predictable and the CPU will prefetch consecutive cachelines so there won't be much stalling on reading memory for the objects. However, if the size of the vector of objects is larger than the size of the L2 (or L3) cache, each iteration of the outer loop will require reloading of the entire cache. But again, that cache reloading will be efficient!
Loop B: If indeed the controllers have many different types of update() methods, the inner loop here may cause wild jumping around in memory, but all these different update functions will be working on data that is cached and available (specially if objects are large or they themselves contain pointers to data scattered in memory.) Unless the update() methods access so much memory themselves (because, e.g., their code is huge or they require large amount of their own - i.e. controller - data) that they thrash the cache on each invocation; in which case all bets are off anyways.

So, I suggest the following strategy generally, which requires information that you probably don't have:

If objects are small (or smallish!) and POD-like (don't contain pointers themselves) definitely prefer loop A.
If objects are large and/or complex, or if there are many many different types of complex controllers (hundreds or thousands of different update() methods) prefer loop B.
If objects are large and/or complex, and there are so very many of them that iterating over them will thrash the cache many times (millions of objects), and the update() methods are many and they are very large and complex and require a lot of other data, then I'd say the order of your loop doesn't make any difference and you need to consider redesigning objects and controllers.

Sorting the Code

If you can, it may be beneficial to sort the controllers based on their type! You can use some internal mechanism in Controller or something like typeid() or another technique to sort the controllers based on their type, so the behavior of consecutive update() passes become more regular and predictable and nice.

This is a good idea regardless of which loop order you choose to implement, but it will have much more effect in loop B.

However, if you have so much variation among controllers (i.e. if practically all are unique) this won't help much. Also, obviously, if you need to preserve the order that controllers are applied, you won't be able to do this.

Adaptation and Improvisation

It should not be hard to implement both loop strategies and select between them at compile-time (or even runtime) based on either user hint or based on information available at compile time (e.g. size of T or some traits of T; if T is small and/or a POD, you probably should use loop A.)

You can even do this at runtime, basing your decision on the number of objects and controllers and anything else you can find out about them.

But, these kinds of "Klever" tricks can get you into trouble as the behavior of your container will depend on weird, opaque and even surprising heuristics and hacks. Also, they might and will even hurt performance in some cases, because there are many other factors meddling in performance of these two loops, including but not limited to the nature of the data and the code in objects and controllers, the exact sizes and configurations of cache levels and their relative speeds, the architecture of CPU and the exact way it handles prefetching, branch prediction, cache misses, etc., the code that the compiler generates, and much more.

If you want to use this technique (implementing both loops and switching between them are compile- and/or run-time) I highly suggest that you let the user do the choosing. You can accept a hint about which update strategy to use, either as a template parameter or a constructor argument. You can even have two update functions (e.g. updateByController() and updateByObject()) that the user can call at will.

On Branch Prediction

The only interesting branch here is the virtual update call, and as an indirect call through two pointers (the pointer to the controller instance and then the pointer to its vtable) it is quite hard to predict. However, sorting controllers based on type will help immensely with this.

Also remember that a mispredicted branch will cause a stall of a few to a few dozen CPU cycles, but for a cache miss, the stall will be in the hundreds of cycles. Of course, a mispredicted branch can cause a cache miss too, so... As I said before, nothing is simple and straightforward when it comes to performance!

In any case, I think cache friendliness is by far the most important factor in performance here.

In generic object update loop, is it better to update per controller or per object?

2 Answers2

Cache-Friendliness Is Close to Godliness

Sorting the Code

Adaptation and Improvisation

On Branch Prediction