Performance of breaking apart one loop into two loops

Question

Good Day,

Suppose that you have a simple for loop like below...

for(int i=0;i<10;i++)
{
    //statement 1
    //statement 2
}

Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...

for(int i=0;i<10;i++)
{
    //statement 1
}
for(int i=0;i<10;i++)
{
    //statement 2
}

Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call

cds.update(objectlist);

instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.

score 5 · Accepted Answer · answered Mar 09 '12 at 14:03

It depends on your application.

Possible Drawbacks (of splitting):

your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop

Possible Gains (of splitting):

your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop

These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.

In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.

score 4 · Answer 2 · answered Mar 09 '12 at 13:30

In terms of algorithmic complexity splitting the loops makes no difference.

In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.

score 3 · Answer 3 · answered Mar 16 '12 at 16:49

As noted, the complexity remains.

But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:

Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes

(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)

Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.

Solutions:

Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.

It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.

Literature: * Agner Fog's Guides * Intel's Guides

score 2 · Answer 4 · answered Mar 09 '12 at 13:31

2

With two loops you will be paying for:

increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.

The last point could have a huge impact in either direction. You should measure as with any perf optimization.

answered Mar 09 '12 at 13:31

Unknown1987

1,671
13
30

2

Your third point is likely the most important. It'll come down to whether you fit within the first-level CPU cache or not. If both combined all data fits in the cache splitting won't likely help, but if too big for the cache and splitting is small enough, the gains could be substantial. – edA-qa mort-ora-y Mar 09 '12 at 13:39

score 1 · Answer 5 · answered Mar 09 '12 at 13:30

As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.

score 0 · Answer 6 · answered Mar 09 '12 at 13:30

0

You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.

I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.

Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.

answered Mar 09 '12 at 13:30

patrickn

2,501
4
22
21

As stefaanv noted, the cost of looping through all your objects a second time is indeterminate with the information you've given. – patrickn Mar 09 '12 at 13:33
I would also note that the two control structures you posted solve different problems, and thus are not easily compared in the context of performance. – patrickn Mar 09 '12 at 13:41
Without knowing more details and without actual measurement, it is impossible to say which version is faster. Caching, both data and instruction, as well as branch prediction (and -tables) and speculative execution add a lot of complexity to nowadays' optimization. Good point though on premature optimization. Measure first in the real world, then optimize. – Sebastian Mach Mar 16 '12 at 16:34

Performance of breaking apart one loop into two loops

6 Answers6

Linked

Related