Use two loop bodies or one (result identical)?

Question

I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.

In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?

Assuming:

Cost of a f and g call is negligible compared to cost of the loop
f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"

I want answers outside the "premature optimization is evil" character, if possible.

An example of the two-loops version that I am advocating for:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++) {
    j += f(i);
}

for(int i = 0; i < 1000000; i++) {
    k += g(i);
}

*"no extra switches"* - the GCC default is `-O0`, anti-optimized for consistent debugging. You definitely don't want that, and performance of un-optimized code is not just uniformly n% slower; it has *different* bottlenecks that make it usually not useful to benchmark debug builds. e.g. [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) — Peter Cordes, Aug 14 '21 at 13:08
Very good point, @PeterCordes, it's been a while I asked this question, and I've definitely learned a lot since, so it kind of is embarrassing to be reading my question now, especially the "no extra switches". Or it was a brain fart on my part then, too, I am not sure. I left the question as-is because I shouldn't be editing it either way now -- if I edit it to reflect my current understanding the question loses meaning, so I might as well outright delete it. But I thought it's best left for posterity of sorts. But do try and convince me otherwise, if you want :) — Armen Michaeli, Sep 09 '21 at 22:19
I think leaving it as is, with my comment to let future readers know that's not a good way to benchmark, is the best bet. Some might argue that comments are ephemeral and that it would be better to edit a note into the answer saying something like ("2021 update: I now realize this was not good benchmarking methodology" ... with a link), but I don't think that's necessary. I don't anticipate anyone marking my comment as "no longer needed". — Peter Cordes, Sep 09 '21 at 22:36

score 10 · Answer 1 · answered Jul 23 '10 at 20:51

10

To measure is to know.

answered Jul 23 '10 at 20:51

Jim Lewis

43,505
7
82
96

1

13 years after asking the question, I just want to say that I often recall this simple but profound truth, specifically about every time I am profiling a body of code I have written, rolling my sleeves up and trying to measure things myself instead of giving in and asking what will turn out to be a question that most likely would not have a _good_ general answer. So again, thank you -- shortest answer I have received on this site that fits to be _the_ answer, considering what I have learned since asking the question. – Armen Michaeli Mar 29 '23 at 08:45
1

@ArmenMichaeli, thank you for the kind words! – Jim Lewis Mar 29 '23 at 19:27

score 6 · Answer 2 · answered Jul 23 '10 at 21:15

6

Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.

On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.

As you say: it depends.

answered Jul 23 '10 at 21:15

CB Bailey

755,051
104
632
656

That's exactly why I was curious - I think when `f` and `g` are complex enough so that each needs most of the cache(s) for itself, calling both one after another within one loop body will have a detrimental effect on performance, absolutely. But that's my uneducated opinion of course. – Armen Michaeli Jul 25 '10 at 21:48

Michael Foukarakis · Accepted Answer · 2010-07-26T05:54:05.017

I can see three variables (even in a seemingly simple chunk of code):

What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.

A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.

Given how the question was indeed too vague, I think your answer is THE answer here. Thanks. — Armen Michaeli, Jul 25 '10 at 21:44
Sandybridge-family (including current Ice Lake / Tiger Lake) have a relatively small uop cache, like 1536 uops if packed perfectly. So one huge loop body could hurt front-end throughput if you exceed that limit. But f() and g() would have to be seriously complicated for that to be a real problem. If looping over the same data, it's much better to only have to load it into registers once (and to only do the loop control once). Computation intensity = ALU work per load (into registers or into L1d cache), and higher intensity is needed to feed wider pipelines and higher-throughput SIMD ALUs. — Peter Cordes, Aug 14 '21 at 13:05

score 3 · Answer 4 · answered Jul 24 '10 at 05:55

Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.

If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.

Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.

Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.

Assuming I understood your question correctly I think the single loop is probably faster in general.

@amn what is it that is in cache/memory that you are trying to optimize? — old_timer, Jul 26 '10 at 14:25
Hello. Well, the question arose of general curiosity. Occasionally, and not just when using C, I am in a similar situation with an actual code. The set does not live anywhere, because the loop is `for(int i = 0; i < 1000000; i++)` and so only `i` will most likely be in memory and various caches at all times, depending on how smart the compiler is. I am trying to optimize the work done by each loop, but as both `f` and `g` are hypothetical, I admit it may be too vague for a definite answer. — Armen Michaeli, Jul 26 '10 at 15:41
depending on the loop and how much code is in the loop, etc i can be in a register and never go to memory (even if a location is reserved for it). What kills you if you are only looking at the loop variable is the branch to the top of the loop, flushing the pipe, its like any branch, not as bad as a function call that has to setup the arguments, but it does cause a pipeline flush. — old_timer, Jul 26 '10 at 16:44
if I was in memory (stack if local variable, main memory if global) then it would be cached. maybe an occasional bump from cache but next time through the loop back in cache. two loops would mean twice as many reads from cache which still costs you so two loops is slower with or without a cache as far as the loop variable i goes. — old_timer, Jul 26 '10 at 16:46
On simple cores without complex overlapping pipelines, there's often a "sweet spot" of having as much stuff in each loop as will fit without creating extra register spills. Unfortunately, compilers like gcc and clang seem to use heuristics which perform poorly on the Cortex-M0, sometimes replacing a simple loop that would fit in registers with a 4x, unrolled version that doesn't and consequently takes more than 4x as long to execute. — supercat, Apr 10 '23 at 22:13

score 2 · Answer 5 · answered Jul 25 '10 at 22:07

Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...

From your example:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
}

for(int i = 0; i < 1000000; i++)
{
    k += g(i);
}

I would either fuse the two loops into one loop like this:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
    k += g(i);
}

Of if this is not possible do the optimization called Loop-Tiling:

#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps   */
                       /* the working-set below your first level cache size */

int i=0; 
int elements = 100000;

do {
  int n = i+TILE_SIZE; 
  if (n > elements) n = elements;

  // perform loop A
  for (int a=i; a<n; a++)
  {
    j += f(i);
  }

  // perform loop B
  for (int a=i; a<n; a++)
  {
    k += g(i);
  }

  i += n
} while (i != elements)

The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.

Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..

score 1 · Answer 6 · answered Jul 24 '10 at 06:22

1

If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.

But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.

answered Jul 24 '10 at 06:22

joe snyder

3,629
2
21
13

@amn: no, you only referenced "optimization", not quality. and your desire to claim "all else being equal" for two unequal pieces of code is questionable. cherrypicking what factors are allowed to be considered to determine which code is "faster" is just an artificial puzzle that i think will more likely lead to bad habits than good programming. – joe snyder Jul 26 '10 at 04:12

score -1 · Answer 7 · answered Jul 23 '10 at 20:38

-1

This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).

answered Jul 23 '10 at 20:38

Vasiliy Sharapov

997
1
8
27

1

Disabling optimizations generally isn't a good idea: You'll benchmark something completely different than what you would really get when you use the code. You should benchmark with the same optimizations which you would for the real program, else it won't reflect the actual running times. – sth Jul 23 '10 at 20:46
@sth : I meant if he wanted to see which method was computationally faster, he could disable optimization to get the same effect as counting clocks by hand. – Vasiliy Sharapov Jul 23 '10 at 20:56
But that's the thing: disabling optimizations creates *different* bottlenecks (like store-forwarding latency from keeping loop counters in memory) than you'd get from two loops optimized separately. If your compiler really does loop fusion (most don't in most cases), you could hide your code from the optimizer by putting the loops in different functions in different source files, and not using link-time optimizations. (Assuming function-call overhead is minimal.) – Peter Cordes Aug 14 '21 at 13:00

Use two loop bodies or one (result identical)?

7 Answers7

Linked