0

I am analyzing the following code for performance:

template <int padding1, int padding2>
struct complex_t {
        float re;
        int p1[padding1];
        double im;
        int p2[padding2];
};

For the experiment, I am using values for padding1 and padding2, so that the sizeof(complex_t) is always 64. I am changing the offset of member im using padding1. I use two randomly generated arrays of complex_t, each of which has 10K elements. Next, I perform a pairwise multiplication between the two arrays and measure the runtime and number of executed instructions. Here is the multiplication code:

template <typename Complex>
void multiply(Complex* result, Complex* a, Complex* b, int n) {
    for (int i = 0; i < n; ++i) {
        result[i].re = a[i].re * b[i].re - a[i].im * b[i].im;
        result[i].im = a[i].re * b[i].im + a[i].im + b[i].re;
    }
}

And here are the measure results (5 runs, Intel(R) Core(TM) i5-10210U CPU, Compiler CLANG 15.0.7, flags -O3):

offsetof(im) Runtime(MIN,AVG,MAX) in sec Instructions AVG
8 bytes 0.107, 0.112, 0.116 175027800
16 bytes 0.088, 0.088, 0.088 175027200
24 bytes 0.088, 0.088, 0.088 175027100
32 bytes 0.088, 0.088, 0.088 175027100
40 bytes 0.088, 0.088, 0.088 175027100
48 bytes 0.085, 0.085, 0.086 175027100

As you can see, the instruction count is roughly the same. Yet, the first sample, with the smallest offset is the slowest. There is something weird going on, hitting some kind of hardware bottleneck. But I don't understand what is the problem because I am missing a mental model of how data caches work in this low level. Can someone give me some ideas on what to look or what to measure?

UPDATE: A counter MEM_LOAD_RETIRED_L3_HIT is unusually high for the smallest offset: 5097404 vs 2775653 (16), 3015093 (24), 3277559 (32), 3261758 (40) and 3445190 (48).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Bogi
  • 2,274
  • 5
  • 26
  • 34
  • What compiler flags are you using when testing? Why not let the compiler pad for you, or adjust padding via flags? For what it's worth, the instruction count is largely irrelevant here. Larger padding values will make this data harder to cache as it takes up more space. – tadman May 27 '23 at 20:28
  • What are your performance goals here, and how close are you now? Would using SIMD help? Would shifting to a GPU approach be better? Even a mid-range GPU can absolutely crush on floating point math in ways a CPU just can't touch. – tadman May 27 '23 at 20:29
  • I am just experimenting and trying to figure out what is going on behind the scene. – Bogi May 27 '23 at 20:33
  • Nothing wrong with experimenting. I'd advise dropping into a debugger to look at the resulting memory layout using no padding, default padding, and your custom padding. A lot of things can happen to your struct if you let the compiler really go at it. – tadman May 27 '23 at 20:33
  • It's also worth mentioning that in some cases you'll want to specialize and have an "array" structure with two parallel arrays if the types differ in size/alignment, like in this case, `complex_array_t` could have `float re[N]` and `double im[N]`. Generally the less padding there is, the better, but not at the expense of mucked up alignment. I do wonder why your precision for real and imaginary parts differs, though. – tadman May 27 '23 at 20:35
  • 1
    https://godbolt.org/z/asj66rbYx shows no meaningful difference in code-gen for `complex_t<1,12>` vs. `complex_t<7,6>`, only differences in displacements in addressing modes. i5-10210U is Comet Lake (https://ark.intel.com/content/www/us/en/ark/products/195436/intel-core-i510210u-processor-6m-cache-up-to-4-20-ghz.html), and you left out `-mbranches-within-32B-boundaries`, so that'd be my first guess, that you happened to get unlucky with [the JCC erratum](https://stackoverflow.com/q/61256646) in the code that tests that case but not in the others. – Peter Cordes May 28 '23 at 07:48
  • 1
    Either that or because it's the first test, if you didn't do warm-up runs then you could be including the cost of page faults in its timed region, for the first writes to touch the output array. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987). Or if not page faults, then of getting it into L3 cache and writing back the dirty input data, if the last thing before the timed region was looping over the input array generating random numbers. – Peter Cordes May 28 '23 at 07:49
  • It is because it was the first run. First run is always slower than the other runs. Although the input data was in cache, the output wasn't. – Bogi May 29 '23 at 20:22

1 Answers1

1

You could try using online CPU simulator developed by Intel: https://uica.uops.info/ .

It collects different statistics and shows you what is the bottleneck in your program. Just upload your assembly code and hit the run button.

Here is an example from the website (not your code):

Throughput (in cycles per iteration): 4.00
Bottleneck: Dependencies

The following throughputs could be achieved if the given property were the only bottleneck:

  - DSB: 1.00
  - Issue: 1.50
  - Ports: 1.50
  - Dependencies: 4.00
Stanislav
  • 23
  • 7