Make out-of-order CPU run instructions in-order

Question

Consider the loop:

for (int i = 0; i < n; i++) {
   sum += a[i];
}

An out-of-order CPU can execute many instructions in advance, it can e.g. have 20 parallel pending loads of a[i] from 20 different iterations of the loop.

But, for me, this is a hindrance. I want that the CPU works like an in-order CPU. I want it to not start a load in the next iteration until it has finished the load in the current iteration.

The reason I want this is very simple: I want to save the memory bandwidth for other processes running on other CPU core. This process is low priority, and I want to limit is as much as possible even though it will get slower.

Two techniques come to mind: fake loop-carried dependencies and memory barriers.

For fake dependencies, something like this can be used:

double* a_current = a;
for (int i = 0; i < n; i++) {
   volatile int a_val = *a_current;
   sum += a_val;
   a_current += 1 + (a_val - a_val);
}

This is horrible code and I wonder if there is something better.

About memory barriers, I don't know almost anything. What could be useful there?

I'm not sure how this serves your purpose. Limiting this thread's memory bandwidth will by definition slow it down, which is what you say you don't want to do. And if you are okay slowing it down, then memory barriers don't achieve anything more than a simple delay (pad with nops or something). — Nate Eldredge, Jan 20 '23 at 16:08
*without slowing it down.* - destroying memory-level parallelism will obviously slow it down significantly. What kind of performance penalty is acceptable here? Factor of 10? Factor of 100? If so, `_mm_lfence()` inside the loop is the simple way, blocking out-of-order exec of all instructions (i.e. drain the ROB before issuing later insns), not just memory (because that's what x86 LFENCE does https://www.felixcloutier.com/x86/lfence; it's not useful for memory ordering except in very obscure corner cases). — Peter Cordes, Jan 20 '23 at 16:09
See [Is LFENCE serializing on AMD processors?](https://stackoverflow.com/q/51844886) / [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/q/4537753) — Peter Cordes, Jan 20 '23 at 16:09
Your reasoning is probably wrong. The code needs however much memory bandwidth it needs to get done what it has to get done. Getting it done quickly so it no longer needs any memory bandwidth is probably just as helpful to other code as slowing it down so it uses less memory bandwidth for longer. Getting low-priority tasks run as quickly as possible so the computer can 100% focus on higher-priority tasks for as long as possible is almost always the better approach. — David Schwartz, Jan 20 '23 at 16:22
@NateEldredge This was an error in my text. I am ok with slowing it down by a factor of 10x if this saves memory bandwidth. — Bogi, Jan 20 '23 at 21:22
@DavidSchwartz: The OP's previous question ([Prevent a CPU core from using the LL cache](https://stackoverflow.com/q/75185131)) said the high-priority task had low-latency requirements. So there's a real-time aspect to this, not just total throughput. An acceptably-small amount of competition for longer can maybe make sense. Note that bandwidth is a rate, bytes/second, not a total amount of traffic the way the term is sometimes misused. The total amount of traffic per work is probably fixed, but slowing it down will lower the bandwidth it competes for in memory controllers. — Peter Cordes, Jan 21 '23 at 04:37
"*Note that bandwidth is a rate, bytes/second, not a total amount of traffic the way the term is sometimes misused.*" In a fixed unit of time, it's total throughput. It's perfectly sensible to, for example, describe an amount of energy at some known power level as "3 seconds of power", which would mean it would take three seconds to get the needed energy if given 100% of the system's power, longer if given a lower fraction. — David Schwartz, Jan 21 '23 at 06:36
@DavidSchwartz Yes, I want to decrease the bandwidth. The program must take N bytes of data, regardless of it is doing faster or slower. — Bogi, Jan 21 '23 at 08:41
I wonder if using a power saving mode would provide adequate slowdown/bandwidth use reduction. By the way, having more outstanding memory accesses will **tend** to increase actual bandwidth since memory controllers can better schedule accesses to avoid dead time from bank conflicts (rare), row activate/precharge overhead, etc.; this effect might not be significant (but it is an interesting effect). — , Jan 22 '23 at 16:55

Make out-of-order CPU run instructions in-order

0 Answers0