If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++
operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example
int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
// Case 3: 2.5 sec
data[i] = &a;
// Case 2: 1.25 sec
// if (i & 1)
// data[i] = &a;
// else
// data[i] = &b;
}
for (auto i = 0ul; i < 1000000000; ++i) {
// Case 1: 0.5sec
// asm volatile("" : "+g"(i)); // deoptimize
// ++*data[0];
++*data[i & 1];
}
In summary, the observations are: (described the loop body)
case 1 (fast): ++*pointer[0]
case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.
case 3 (slow): ++*pointer[i] with all pointer pointing to the same int
Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)
The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550