Consider the loop:
for (int i = 0; i < n; i++) {
sum += a[i];
}
An out-of-order CPU can execute many instructions in advance, it can e.g. have 20 parallel pending loads of a[i] from 20 different iterations of the loop.
But, for me, this is a hindrance. I want that the CPU works like an in-order CPU. I want it to not start a load in the next iteration until it has finished the load in the current iteration.
The reason I want this is very simple: I want to save the memory bandwidth for other processes running on other CPU core. This process is low priority, and I want to limit is as much as possible even though it will get slower.
Two techniques come to mind: fake loop-carried dependencies and memory barriers.
For fake dependencies, something like this can be used:
double* a_current = a;
for (int i = 0; i < n; i++) {
volatile int a_val = *a_current;
sum += a_val;
a_current += 1 + (a_val - a_val);
}
This is horrible code and I wonder if there is something better.
About memory barriers, I don't know almost anything. What could be useful there?