I stumbled upon a peculiar performance issue when running the following c++
code on some Intel Xeon processors:
// array_a contains permutation of [0, n - 1]
// array_b and inverse are initialized arrays
for (int i = 0; i < n; ++i) {
array_b[i] = array_a[i];
inverse[array_b[i]] = i;
}
The first line of the loop sequentially copies array_a
into array_b
(very few cache misses expected). The second line computes the inverse of array_b
(many cache misses expected since array_b
is a random permutation). We may also split the code into two separate loops:
for (int i = 0; i < n; ++i)
array_b[i] = array_a[i];
for (int i = 0; i < n; ++i)
inverse[array_b[i]] = i;
I would have expected the two versions (single vs. dual loop) to perform almost identically on relatively modern hardware. However, it appears that some Xeon processors are incredibly slow when executing the single loop version.
Below you can see the wall time in nano-seconds divided by n
when running the snippet on a range of different processors. For the purpose of testing, the code was compiled using GCC 7.5.0 with flags -O3 -funroll-loops -march=native
on a system with a Xeon E5-4620v4. Then, the same binary was used on all systems, using numactl -m 0 -N 0
on systems with multiple NUMA domains.
The used code is available on github. The interesting stuff is in the file runner.cpp.
[EDIT:] The assembly is provided here.
[EDIT:] New results including AMD EPYC.
On the various i7 models, the results are mostly as expected. Using the single loop is only slightly slower than the dual loops. This also holds for the Xeon E3-1271v3, which is basically the same hardware as an i7-4790. The AMC EPYC 7452 performed best by far, with almost no difference between the single and dual loop implementation. However, on the Xeon E5-2690v4 and E5-4620v4 systems using the single loop is incredibly slow.
In previous tests I also observed this strange performance issue on Xeon E5-2640 and E5-2640v4 systems. In contrast to that, there was no performance issue on several AMD EPYC and Opteron systems, and also no issues on Intel i5 and i7 mobile processors.
My question to the CPU experts thus is: Why does Intel's most high-end product-line perform so poorly compared to other CPUs? I am by far no expert in CPU architectures, so your knowledge and thoughts are much appreciated!