I am analyzing the following code for performance:
template <int padding1, int padding2>
struct complex_t {
float re;
int p1[padding1];
double im;
int p2[padding2];
};
For the experiment, I am using values for padding1
and padding2
, so that the sizeof(complex_t)
is always 64. I am changing the offset of member im
using padding1
. I use two randomly generated arrays of complex_t
, each of which has 10K elements. Next, I perform a pairwise multiplication between the two arrays and measure the runtime and number of executed instructions. Here is the multiplication code:
template <typename Complex>
void multiply(Complex* result, Complex* a, Complex* b, int n) {
for (int i = 0; i < n; ++i) {
result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
}
}
And here are the measure results (5 runs, Intel(R) Core(TM) i5-10210U CPU, Compiler CLANG 15.0.7, flags -O3
):
offsetof(im) |
Runtime(MIN,AVG,MAX) in sec | Instructions AVG |
---|---|---|
8 bytes | 0.107, 0.112, 0.116 | 175027800 |
16 bytes | 0.088, 0.088, 0.088 | 175027200 |
24 bytes | 0.088, 0.088, 0.088 | 175027100 |
32 bytes | 0.088, 0.088, 0.088 | 175027100 |
40 bytes | 0.088, 0.088, 0.088 | 175027100 |
48 bytes | 0.085, 0.085, 0.086 | 175027100 |
As you can see, the instruction count is roughly the same. Yet, the first sample, with the smallest offset is the slowest. There is something weird going on, hitting some kind of hardware bottleneck. But I don't understand what is the problem because I am missing a mental model of how data caches work in this low level. Can someone give me some ideas on what to look or what to measure?
UPDATE: A counter MEM_LOAD_RETIRED_L3_HIT
is unusually high for the smallest offset: 5097404 vs 2775653 (16), 3015093 (24), 3277559 (32), 3261758 (40) and 3445190 (48).