How to explain poor performance on Xeon processors for a loop with both sequential copy and a scattered store?

Question

I stumbled upon a peculiar performance issue when running the following c++ code on some Intel Xeon processors:

// array_a contains permutation of [0, n - 1]
// array_b and inverse are initialized arrays
for (int i = 0; i < n; ++i) {
  array_b[i] = array_a[i];
  inverse[array_b[i]] = i;
}

The first line of the loop sequentially copies array_a into array_b (very few cache misses expected). The second line computes the inverse of array_b (many cache misses expected since array_b is a random permutation). We may also split the code into two separate loops:

for (int i = 0; i < n; ++i)
  array_b[i] = array_a[i];
for (int i = 0; i < n; ++i)
  inverse[array_b[i]] = i;

I would have expected the two versions (single vs. dual loop) to perform almost identically on relatively modern hardware. However, it appears that some Xeon processors are incredibly slow when executing the single loop version.

Below you can see the wall time in nano-seconds divided by n when running the snippet on a range of different processors. For the purpose of testing, the code was compiled using GCC 7.5.0 with flags -O3 -funroll-loops -march=native on a system with a Xeon E5-4620v4. Then, the same binary was used on all systems, using numactl -m 0 -N 0 on systems with multiple NUMA domains.

The used code is available on github. The interesting stuff is in the file runner.cpp.

[EDIT:] The assembly is provided here.

[EDIT:] New results including AMD EPYC.

On the various i7 models, the results are mostly as expected. Using the single loop is only slightly slower than the dual loops. This also holds for the Xeon E3-1271v3, which is basically the same hardware as an i7-4790. The AMC EPYC 7452 performed best by far, with almost no difference between the single and dual loop implementation. However, on the Xeon E5-2690v4 and E5-4620v4 systems using the single loop is incredibly slow.

In previous tests I also observed this strange performance issue on Xeon E5-2640 and E5-2640v4 systems. In contrast to that, there was no performance issue on several AMD EPYC and Opteron systems, and also no issues on Intel i5 and i7 mobile processors.

My question to the CPU experts thus is: Why does Intel's most high-end product-line perform so poorly compared to other CPUs? I am by far no expert in CPU architectures, so your knowledge and thoughts are much appreciated!

I think this is a good and interesting question, though I am not sure it wouldn't be more suited for SuperUser https://superuser.com/ than Stack Overflow. — paisanco, Sep 07 '20 at 15:30
Xeon E5-2640 not v3 or anything? So it's a clunky old Sandybridge? i.e. "Intel's most high-end product-line" from most of a decade before your AMD system, and years before your i7-6800k Broadwell. BTW, performance tuning questions like this are totally on-topic for Stack Overflow; we have a whole [tag:performance] tag that definitely includes questions like this. — Peter Cordes, Sep 07 '20 at 16:06
Possibly related: [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) - memory latency is significantly different on Xeon systems, and significantly *worse*. Also related: if that's a dual socket Sandybridge Xeon, the other socket's uncore (L3 + ring bus) might have clocked way down, hurting memory performance on this socket. See Dr. Bandwidth quotes + links in [Are caches of different level operating in the same frequency domain?](https://stackoverflow.com/a/54796834) — Peter Cordes, Sep 07 '20 at 16:13
Also related: https://community.intel.com/t5/Software-Tuning-Performance/Single-Threaded-Memory-Bandwidth-on-Sandy-Bridge/td-p/959584 (even though your problem is more about scattered access, not so much bandwidth). — Peter Cordes, Sep 07 '20 at 16:15
GCC including GCC7.5.0 splits the loop into a call to memcpy and then a scatter loop at `-O3`. https://godbolt.org/z/nnz9P9. Are you sure that wasn't happening for you? You didn't say which machine you used `-march=native` on, but GCC7.5 doesn't know about `-march=znver2`. Oh, in your github test code, you have std::vector, not global arrays, so alias analysis doesn't manage to do that. — Peter Cordes, Sep 07 '20 at 16:31
Happens on both the "clunky old" Xeon E5-2640, and also the much newer Xeon E5-2640v4 and E5-4640v4. In general it does not matter on which machine I compile the code -- even if I compile on the Xeon with `-march=native`, and then use the resulting binary on the AMD system (or vice versa). The issue persists. — Jonas Ellert, Sep 07 '20 at 16:54
It would be best to run the same binary on every system and provide the disassembly of both variants of the core function. — BeeOnRope, Sep 07 '20 at 17:29
Note that even your "i7" is a Broadwell-E, not a pure "client" chip like a quad-core Skylake. So it's fairly similar to a single-socket Xeon v4. No cross-socket snooping effects, but still the higher uncore latency and more ring-bus hops, and I think lower single-core bandwidth than a "client" chip. — Peter Cordes, Sep 08 '20 at 07:03
I have some more systems at my disposal. I'll add the disassembly and new results later. — Jonas Ellert, Sep 08 '20 at 07:11
I see how `max_concurrency / latency` might be a limiting factor, but it is not entirely consistent with the observed throughput. The 14 core E5-2690V4 performs much better than the 10 core E5-4620v4. Since both of these are Broadwell, I would have assumed that they have a similar latency. And then for AMDs architecture everything seems to be entirely different. The EPYC with 32 cores behaves even better than the Intel quad-core CPUs. — Jonas Ellert, Sep 09 '20 at 07:54
AMD's system architecture *is* fundamentally different from Intel's. Zen / Zen2 use clusters of 4 cores, with each cluster having its own L3 just for that cluster. *Not* a large shared L3 cache. Also, the design of a single core, with its store buffer, is presumably different for AMD. Perhaps their heuristics for software prefetch and/or early RFO for store buffer entries happen to be better. I don't have a specific explanation for what you're seeing, but yes you certainly seem to have found something that's qualitatively different between AMD and Intel. We do know such diffs exist. — Peter Cordes, Sep 11 '20 at 10:08
It looks like the `run` function returns time in microseconds, not milliseconds, which means that the `nps` function calculates n per milliseconds, not n per seconds. Is the unit of the y axis correct? Another thing is that it's important to have the same core and memory frequencies across all systems. For the core frequency, set it to a fixed non-turbo value that is supported by all systems (or as close as possible). For the memory frequency, different systems support different protocols and frequencies, so it's not possible to have a close match, unfortunately. Should be taken into consid. — Hadi Brais, Sep 13 '20 at 02:52
I think that random access to `inverse` makes harm to sequental accesses to `array_a`/`array_b` in some place in memory architecture. Something is confused by it. I can be hardware prefetcher (does anything change if `_mm_prefetch` is added for `array_a` or/and `inverse` several iterations ahead?), false dependency (try to allocate all buffers page-aligned, but shift `array_a` pointer by 2624 bytes). According to [this](https://stackoverflow.com/questions/45382914/parallel-memory-access-on-modern-processors), memory controller and even DDR modules can have impact too. — stgatilov, Sep 17 '20 at 07:09
Also, it can be false dependency `n` on array data. The assembly shows that compiler cannot prove that writing to `inverse` does not modify `n` (aliasing problem), so it loads `n` from memory on every iteration. If CPU decides that it might overlap with your writes (hint: it should not), then you have big problem. Better copy the value to local variable `n1` and use it in loop. In this case there is no problem with other variables, but in general you might want to do the same to pointers, and also probably mark pointers as `restrict` so that compilers knows they don't overlap. — stgatilov, Sep 17 '20 at 07:18
@Jonas - could you collect results also for smaller region sizes, down to say for 4k? Only for one the slow Intel machines, since they exhibit the curious behavior. — BeeOnRope, Sep 19 '20 at 03:46
Presumably, all those systems are different. How much actual RAM do they have access to? — Bahbar, Oct 15 '20 at 19:10

score 1 · Answer 1 · answered Apr 26 '23 at 11:38

After playing with a minimized example shown below I'm concluding that the root cause is a variant of the 2017 question by Travis Downs titled Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake, for which he provided further specu^Wdiscussion in his 2019 blog post What Has Your Microcode Done for You Lately? and 2020 post to RWT. I think the graphs shown in the question show how the effect is exacerbated on multi-socket systems, where responses to RFOs take longer, making RFO serialization more visible.

In my test program, a PRNG is used instead of an extra array with permuted indices, and the effect is still visible. The PRNG has latency 4 cycles, and you can see that when the working set fits in L1 cache, each iteration takes a bit more that 4 * (1 << log_sz) cycles. When there are L1 misses, a great slowdown is observed, with apparently-useless prefetches recovering a lot of performance. Sticking the same magic prefetch in the original program also improves performance.

#include <stddef.h>
#include <stdio.h>
#include <stdint.h>

#include <sys/mman.h>

static uint64_t prng(int bits)
{
    static uint64_t state = 0;
    uint64_t ret = state >> (64 - bits);
    state = state * 6364136223846793005 + 1;
    return ret;
}

typedef uint64_t T;

typedef void fn(volatile T *, int, int);

static void no_prefetch(volatile T *buf, int log_n, int chunk)
{
    size_t n = (size_t)1 << log_n;
    for (size_t i, k = 0; k < n; k = i) {
        for (i = k; i < k + chunk; i++) {
            buf[i] = 0;
        }
        for (i = k; i < k + chunk; i++) {
            size_t j = prng(log_n);
            buf[j] = 0;
        }
    }
}

static void do_prefetch(volatile T *buf, int log_n, int chunk)
{
    size_t n = (size_t)1 << log_n;
    for (size_t i, k = 0; k < n; k = i) {
        for (i = k; i < k + chunk; i++) {
            buf[i] = 0;
        }
        for (i = k; i < k + chunk; i++) {
            size_t j = prng(log_n);
            __builtin_prefetch((T*)buf+j, 1);
            buf[j] = 0;
        }
    }
}

static fn *const fns[] = { no_prefetch, do_prefetch };

int main(int argc, char **argv)
{
    unsigned with_prefetch, chunk, log_sz, reps;
    if (argc < 2 || sscanf(argv[1], "%u", &with_prefetch) != 1)
        with_prefetch = 0;
    if (argc < 3 || sscanf(argv[2], "%u", &chunk) != 1)
        chunk = 1;
    if (argc < 4 || sscanf(argv[3], "%u", &log_sz) != 1)
        log_sz = 13;
    if (argc < 5 || sscanf(argv[4], "%u", &reps) != 1)
        reps = 10000;
    size_t sz = sizeof(T) << log_sz;
    void *m;

    if ((m = mmap(0, sz, PROT_READ | PROT_WRITE,
              MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)) == MAP_FAILED)
        return 1;

    for (; reps; reps--)
        fns[with_prefetch](m, log_sz, chunk);
}

score 0 · Answer 2 · answered Apr 02 '22 at 08:34

0

Maybe this is related to avx-512 frequendy throttling on Intel processors. These instructions generate a lot of heat, and if used on some circunstances the processor reduces the working frequency.

Here are some benchmarks for OpenSSL that show the effect. There is a rant from Linus Torvalds on this topic.

If avx-512 instructions are generated with "-march=native" you may be suffering this effect. Try disabling avx-512 with:

gcc -mno-avx512f

answered Apr 02 '22 at 08:34

Miguel

119
1
2

Interesting theory, since GCC7.4 at least does indeed default to using 512-bit ZMM vectors when available (with hugely bloated loop peeling of setup/cleanup https://godbolt.org/z/Mr35rPeqq even for a copy, instead of just calling memcpy), unlike how later GCC default to `-mprefer-vector-width=256` for `-march=skylake-avx512`. But **not applicable here because none of those CPUs support AVX-512.** They're Broadwell at the latest, or Coffee Lake client, no Xeon Scalable. – Peter Cordes Apr 02 '22 at 09:08
For details on the CPU frequency effect, see [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) on SO. A version of the same thing happens with 256-bit "heavy" instructions. But pure loads/stores aren't "heavy", and I doubt scatter stores are, either. (Not that GCC7.4 is able to vectorize the inverse part anyway. https://godbolt.org/z/c6G5sE7G7. But GCC11.2 is, with AVX-512 scatters using YMM 256-bit register width: https://godbolt.org/z/z5n1aEq1f) – Peter Cordes Apr 02 '22 at 09:18
That also wouldn't really explain the *shape* of the weird curves. This also has a performance difference (about a factor of 10) much larger than the biggest frequency difference (like 1.6x with all cores busy on a Xeon Gold 5120 for example for 128-bit vs. heavy 512-bit. But only 1.375x for 256-heavy vs. 512-heavy, or 1.2x for 256-light (no throttle) vs. 512-light.) – Peter Cordes Apr 02 '22 at 10:20

How to explain poor performance on Xeon processors for a loop with both sequential copy and a scattered store?

2 Answers2