C++ What does the execution time of a class method depend on?

Question

I am working on a monte carlo simulation code, with about one cpp-file per class. While executing a checkOverlap function is called for each pair of neighbouring particles, thus it is quite important for the execution time. I noticed that I get a huge difference in execution time when I comment out that function compared to when I immediately return from it. While investigating further I found that the evaluation speed of a class method depends on the class it is called from.

Currently I have a class layout similar to this pseudo-code

class CollisionLogic{
public:
    bool testCall(){
        return false;
    }
    bool checkOverlap(void particle1, void particle2){
        //do something...
        return value;
    }
};
class InformationContainer{
public:
    bool testCall(){
        return false;
    }
    CollisionLogic CL;
};

In main code does the following

InformationContainer IC;

checkCollisionOn( particle P )
{
    for( 'each neighbouring particle P_n' )
    {
        if( IC.CL.checkOverlap(P,P_n) )
            return true;

        /* Marker */
    }
}

For testing purposes checkOverlap returns false as a first call. Then the evaluation of a frame takes 953ms. If I replace the marker comment by IC.CL.testCall() the time increases to 1232ms, if replaced by IC.testCall() it increases even further to 1305ms.

As both functions do exactly the same I assume I can rule out cpu computation time. So my questions are: What causes this to happen?, and what do I need to do to stop it?

Thank you!

Q/A Compiling:
I compile every code file into an object file with the flags '-O3 -std=gnu++11' and link them together afterwards with the same flags.

Ps: I have found several issues explanations on different c++ speed issues. [1,2,3,4] But not one like this.

[1] Speed of C program execution
[2] Why are elementwise additions much faster in separate loops than in a combined loop?
[3] c++ speed comparison iterator vs index
[4] Why are elementwise additions much faster in separate loops than in a combined loop?

Yes, -O3. But the function calls are not removed, as they are defined in different .o/.cpp files. — HeavensInc, Jun 22 '17 at 08:58

score 1 · Accepted Answer · answered Jun 22 '17 at 12:44

Code Cache Locality

This answer assumes that your program contains more code than just what you posted and that processing a frame executes enough code to evict some pages from at least the lowest level instruction cache of your processor.

In this case I would assume that the difference in execution time has to do with instruction cache locality. Fetching code from first level cache is very fast with higher level caches getting progressively slower but also bigger and therefore less likely to evict a cache line. Accessing main memory takes hundreds of cycles on a modern desktop processor, because the preocessors have gotten much faster while memory delays decrease only very slowly.

Right before your call you called CollisionLogic::checkOverlap, which may be closer to CollisionLogic::testCall than InformationContainer::testCall. Maybe it is even on the same cache line, or on a cache line that the processor has fetched on speculation (the logic that controls the cache is very intricate on high performance processors). Therefore the processor can quickly fetch the instructions of InformationContainer::testCall.

You can check the instruction addresses of the different methods in a debugger to check this hypothesis.

Especially when CollisionLogic and InformationContainer are in different source files their methods may end up further removed from each other.

Improving Code Cache Locality

I am not aware of portable and maintainable ways to keep functions close to each other in the code section of a binary.

Inlining

One option to solve this problem that I have used for smaller, very "hot" methods and functions is to make them inline functions with the definition in the header and let them be inlined in the calling functions. Unless the compiler decides not to inline the function anyway (which it may do for various reasons, like the function being too large). This not only gets rid of the call overhead but also guarantees cache locality for the particular call.

But you have to measure whether this improves actual performance. Too much inlining will increase overall code size and cause the opposite effect, because it means that a "working set" of code that is frequently accessed together may no longer fit in a particular cache level and cause more chache misses overall.

Influencing Code Location in the Binary

I have also seen an old high-performance project where the original authors have made sure that commonly accessed code was in the same .cpp files. The program ran very fast. The downside was that the code was very hard to read and maintain and performance regressed easily whenever unrelated parts of the program were changed. I do not recommend such an approach if you have to trade of maintainability for speed.

If you are willing to learn about linker scripts, you could write one that puts specific commonly accessed functions together (you may have to assign them to specific sections in the code with compiler specific attributes).

Regardless of the method you choose to improve this, you will have to determine which functions are actually contributing to execution time (by profiling and benchmarking experimental changes like you did), look at the call graph and call counts, and do plenty of benchmarking to check the actual performance implications of your changes.

C++ What does the execution time of a class method depend on?

1 Answers1

Code Cache Locality

Improving Code Cache Locality