Why is transforming an array using AVX-512 instructions significantly slower when transforming it in batches of 8 compared to 7 or 9?

Question

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/x7dYes91M).

#include <immintrin.h>

#include <algorithm>
#include <ctime>
#include <iostream>
#include <numeric>
#include <vector>

#define NUMBER_OF_TUPLES 134'217'728UL

void transform(std::vector<int64_t>* input, std::vector<double>* output, size_t batch_size) {
  for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += batch_size) {
    size_t endOfBatch = std::min(startOfBatch + batch_size, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx < endOfBatch;) {
      if (endOfBatch - idx >= 8) {
        auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
        auto _converted = _mm512_cvtepu64_pd(_loaded);

        _mm512_storeu_epi64(&(*output)[idx], _converted);
        idx += 8;
      } else {
        (*output)[idx] = static_cast<double>((*input)[idx]);
        idx++;
      }
    }

    asm volatile("" : : "r,m"(output->data()) : "memory");
  }
}

void do_benchmark(size_t batch_size) {
  std::vector<int64_t> input(NUMBER_OF_TUPLES);
  std::vector<double> output(NUMBER_OF_TUPLES);

  std::iota(input.begin(), input.end(), 0);

  auto t = std::clock();
  transform(&input, &output, batch_size);
  auto elapsed = std::clock() - t;

  std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl;
}

int main() {
  do_benchmark(7UL);
  do_benchmark(8UL);
  do_benchmark(9UL);
}

It transforms the input array of int64_t to the output array of double in batches of a given batch_size. We have inserted the following AVX-512 intrinsics in case there are still more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance

auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);

Otherwise, we fall back to the scalar implementation.

To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory") call, to make sure that the output data is flushed after each batch.

It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU using

clang++ -Wall -Wextra -march=cascadelake -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vnni -O3 minimal.cpp -o minimal

Executing the code, however, results in the following surprising output

Elapsed time for a batch size of 7: 204007
Elapsed time for a batch size of 8: 237600
Elapsed time for a batch size of 9: 209838

It shows, that for some reason, using a batch_size of 8, the code is significantly slower. However, both, using a batch_size of 7 or 9, is significantly faster.

This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?

Edit:

Added perf results for cache misses

Batch Size 7

 Performance counter stats for process id '653468':

     6,894,467,363      L1-dcache-loads                                               (44.43%)
     1,647,244,371      L1-dcache-load-misses     #   23.89% of all L1-dcache accesses  (44.43%)
     7,548,224,648      L1-dcache-stores                                              (44.43%)
         6,726,036      L2-loads                                                      (44.43%)
         3,766,847      L2-loads-misses           #   56.61% of all LL-cache accesses  (44.46%)
         6,171,407      L2-loads-stores                                               (44.45%)
         6,764,242      LLC-loads                                                     (44.46%)
         4,548,106      LLC-loads-misses          #   68.35% of all LL-cache accesses  (44.46%)
         6,954,088      LLC-loads-stores                                              (44.45%)

Batch Size 8

 Performance counter stats for process id '654880':

     1,009,889,247      L1-dcache-loads                                               (44.41%)
     1,413,152,123      L1-dcache-load-misses     #  139.93% of all L1-dcache accesses  (44.45%)
     1,528,453,525      L1-dcache-stores                                              (44.48%)
       158,053,929      L2-loads                                                      (44.51%)
       155,407,942      L2-loads-misses           #   98.18% of all LL-cache accesses  (44.50%)
       158,335,431      L2-loads-stores                                               (44.46%)
       158,349,901      LLC-loads                                                     (44.42%)
       155,902,630      LLC-loads-misses          #   98.49% of all LL-cache accesses  (44.39%)
       158,447,095      LLC-loads-stores                                              (44.39%)

      11.011153400 seconds time elapsed

Batch Size 9

 Performance counter stats for process id '656032':

     1,766,679,021      L1-dcache-loads                                               (44.38%)
     1,600,639,108      L1-dcache-load-misses     #   90.60% of all L1-dcache accesses  (44.42%)
     2,233,035,727      L1-dcache-stores                                              (44.46%)
       138,071,488      L2-loads                                                      (44.49%)
       136,132,162      L2-loads-misses           #   98.51% of all LL-cache accesses  (44.52%)
       138,020,805      L2-loads-stores                                               (44.49%)
       138,522,404      LLC-loads                                                     (44.45%)
       135,902,197      LLC-loads-misses          #   98.35% of all LL-cache accesses  (44.42%)
       138,122,462      LLC-loads-stores                                              (44.38%)

Could be a branch misprediction, if the compiler always predicts the `else` branch. For batches of 7 this would always be right, and for batches of 9 it would be right half the time, but for batches of 8 it's never right. Try moving the last `batch_size % 8` operations out into a separate `for` loop so you don't need the inner `if` on the hot path anymore. — Thomas, Oct 14 '22 at 12:47
@Thomas For me, `perf stat` says 390k +-5k branch misses per execution for all three benchmarks, giving a misprediction rate of less than 0.08%. The `if` is compiled to a compare+jump, so hardware branch prediction handles these, which should work reliably if there's a predictable pattern, which is the case here. So I'd say branch misprediction is not an issue here. — He3lixxx, Oct 14 '22 at 13:03
Maybe reduced AVX-512 max frequency? You're losing 10-15% which would probably be in the ballpark for at least some CPUs. — bg2b, Oct 15 '22 at 10:51
@bg2b Yeah, I already checked that. While the clock frequency is a higher when the batch size is 7 (around 2.9 GHz), it is 2.4 GHz both when the batch size is 8 or 9 while 8 and 9 show different performance though. — InvisibleShadowGhost, Oct 15 '22 at 11:00
What's the relative performance if each test is a separate process, instead of one test with order 7, 8, 9? — bg2b, Oct 15 '22 at 14:36

Peter Cordes · Answer 1 · 2022-10-16T07:31:41.800

Update: testing (see comments) shows misalignment was not the explanation, and somehow aligning the arrays by 64 makes it slower. I wouldn't expect any 4k aliasing problem since we're loading and then storing, and large aligned allocations probably have the same alignment relative to a page boundary. i.e. are the same % 4096, probably 0. This is true even after simplifying the loops to not do so much branching with a short inner loop.

Your arrays are large and not aligned by 64, since you let std::vector<> allocate them. Using 64-byte vectors, every misaligned load will span a boundary between two 64-byte cache lines. (And you'll trip over the page-split at the end of every 4k page, although that's rare enough in sequential access to not explain this.) Unlike with 32-byte load/store where only every other vector will be a cache-line split.

(Glibc's malloc / new for large allocations typically keeps the first 16 bytes for bookkeeping, so the address it returns is 16 bytes past the start of a page, always misaligned by 32 and 64, always creating the worst case.)

512-bit vectors (on Skylake/Cascade Lake at least) are known to slow down with misaligned 64-byte loads/stores (more than AVX1/2 code with misaligned 32-byte ops). Even when arrays are so large that you'd expect it to just bottleneck on DRAM bandwidth and have time to sort out any misalignment penalties inside the core while waiting for cache lines to arrive.

Single-core DRAM bandwidth on a big Xeon is pretty low vs. a "client" CPU, especially for Skylake-family. (The mesh interconnect was new in that generation, and it's lower than in Broadwell Xeon. Apparently Ice Lake Xeon made a big improvement to max per-core DRAM bandwidth.) So even scalar code is able to saturate memory bandwidth.

(Or perhaps batch=7 was auto-vectorizing with -mprefer-vector-width=256 after fully unrolling the inner loop? No, it wasn't even inlining your loop, and not unswitching that loop into while(full vector left) vector; / while(any left) scalar;, so you have pretty nasty asm that does a lot of branching for each vector and scalar.)

But for some reason code that only ever uses 64-byte loads and stores can't max out one core's bandwidth. But your experiment shows that even a pattern of 1 vector + 1 scalar can help (batch=9), assuming that compiled to match the source.

I don't know why; maybe the load execution units run out of split buffers for handling loads that need data from two cache lines. (Perf event ld_blocks.no_sr). But the scalar loads don't need a split buffer entry because they're always naturally aligned (to 8 bytes). So they can execute if dispatched, maybe triggering fetch of cache lines sooner.

(HW prefetch doesn't work across 4k page boundaries where physical memory might be discontiguous; the L2 streamer only sees physical addresses. So a demand load into the next 4k page can get HW prefetch started early enough to max out DRAM bandwidth to L2, where maybe that wasn't happening if later split vector loads weren't happening. 4k boundaries apply even if using 2M transparent hugepages; the hardware prefetcher doesn't get told that the fetches are part of a contiguous hugepage.)

Batch=9 also makes one of every eight vectors aligned, which might help slightly.

These are wild guesses about microarchitectural causes, not backed up by any performance experiments to test these hypotheses.

Testing with aligned buffers

If you want to at least test that it's misalignment responsible for the whole thing, either look into using a custom allocator for std::vector<int64_t, my_aligned_allocator> and/or std::vector<double, my_aligned_allocator>. (Modern approach to making std::vector allocate aligned memory). This is a good bet for production use, as it then works the same way as std::vector<int64_t>, although the 2nd template parameter makes it not type compatible.

For a quick experiment, make them std::vector<__m512i> and/or <__m512d> and change the loop code. (And compile with at least C++17 to make the standard library respect alignof(T).) (Useful to see whether source or destination misalignment is the critical factor, or both.) For batch=8 you can directly loop over the vectors. In the general case you'll need to static_cast<char*>(src->data()) and do the appropriate pointer math if you want to test this way. GNU C might define behaviour of pointing an double* into a __m512d because it happens to be defined in terms of double, but there are examples of pointing an int* at a __m256i not working as hoped. For a performance experiment, you can just check the asm and see if it's sane.

(Also you'd want to check that the compiler unrolled that inner loop, not actually branching inside a loop.)

Or use aligned_alloc to get raw storage instead of std::vector. But then you'd need to write to both arrays yourself to avoid page faults being part of the timed region for the first test, like std::vector's constructor does. (Idiomatic way of performance evaluation?) (std::vector is annoying when you don't want to write memory before your SIMD loop, since using .emplace_back is a pain with SIMD intrinsics. Not to mention that it sucks at growing, unable to use realloc in most C++ implementations to sometimes avoid having to copy.)

Or instead of writing an init loop or memset, do a warm-up pass? Good idea anyway for AVX-512 to make sure the 512-bit execution units are warmed up, and the CPU is at a frequency where it's able to run 512-bit FP instructions at the lowish throughput needed. (SIMD instructions lowering CPU frequency)

(Maybe __attribute__((noinline,noipa)) on do_benchmark, although I don't think Clang knows GCC's noipa attribute = no inter-procedural analysis.)

@Peteder Cordes Thanks for all the input, I went with C arrays, aligned allocs and c++20 now: https://godbolt.org/z/a7E8sdj9e. It turns out, the performance gets significantly _worse_ in the `batch=8` case: ` Elapsed time for a batch size of 7: 204434 Elapsed time for a batch size of 8: 413352 Elapsed time for a batch size of 9: 213768 ` Isn't that totally against your theory now? (I confirmed that the loads and stores are aligned by using `_mm512_load_epi64` and `_mm512_store_epi64`, which works for 7 and 8, but then obviously fails for 9.) — InvisibleShadowGhost, Oct 16 '22 at 07:07
@InvisibleShadowGhost: And now ... what? That Godbolt link crashes for me (Program returned: 132). If it worked for you, then it's probably because AWS instances sometimes runs on AMD Zen2 or Zen3, which don't support AVX-512. Was 8 faster or at least equal with aligned 64-byte vectors? (Nvm, you did edit your comment with your results.) — Peter Cordes, Oct 16 '22 at 07:10
Sorry, I accidentally hit `enter` too early, the comment is updated :) No, it also crashes in Godbolt for me but runs fine on my local Cascadelake isntance. — InvisibleShadowGhost, Oct 16 '22 at 07:12
@InvisibleShadowGhost: maybe try with the loop simplified to not do all that crazy branching for `if (endOfBatch - idx >= 8)` *inside* the inner loop. With `batch_size` as a runtime variable, it appears clang is running a lot of instructions other than vector loads/stores. I haven't tried to sort through all the code to look for loop-carried data dependencies or anything, but if it's still slow even in the best case with a normal vector loop that clang can unroll by 4, that would basically rule out that overhead as the problem, leaving only 64-byte vector memory access being slow. — Peter Cordes, Oct 16 '22 at 07:19
@InvisibleShadowGhost: e.g. let it auto-vectorize with `-march=cascadelake -mprefer-vector-width=512` (which implies all those `-mavx512f` and so on options; that's part of the point of `-march` over `-mtune`). https://godbolt.org/z/ToazoboTc . That also runs the init code with 512-bit vectors, so that's also already warmed up, in case it matters. (If that version ends up fast, maybe manually write a simple loop with 512-bit intrinsics so we can get back to the same code-gen for init loops. But the clock transition would only take fractions of a millisec, whole test lasts longer.?) — Peter Cordes, Oct 16 '22 at 07:23
Alright, I went with https://godbolt.org/z/8croehx1s now, i.e. individual loops for each case, so no unnecessary branching should happen for the 7/8 case. However, numbers don't improve `Elapsed time for a batch size of 7: 197898 Elapsed time for a batch size of 8: 471224 Elapsed time for a batch size of 9: 212742` — InvisibleShadowGhost, Oct 16 '22 at 07:27
BTW, I tried on Godbolt with `-march=znver3` without AVX-512 stuff, and it still won't run. So probably it's allocating too much memory, and/or running too long and timing out. So it's simply not testable on Godbolt even with a Cascade Lake instance, unless you reduce the array size and make multiple passes with a repeat loop. (Which could still be big enough to blow through L3) — Peter Cordes, Oct 16 '22 at 07:27
@InvisibleShadowGhost: What does `grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference` show? If it's `balance_power` or even `balance_performance`, Skylake CPUs will clock down on memory-bound workloads, but that might make the mesh interconnect run slower, too, reducing single-core memory bandwidth. (By increasing latency in nanoseconds, so bandwidth = parallelism / latency drops.) [Slowing down CPU Frequency by imposing memory stress](//stackoverflow.com/q/63399456) has some test results from my SKL, and a shell command to raise it to `performance` — Peter Cordes, Oct 16 '22 at 07:34
There seems to be no `energy_performance_preference` in `/sys/devices/system/cpu/cpufreq/policy[0-9]*/` for me, only `affected_cpus cpuinfo_max_freq cpuinfo_min_freq cpuinfo_transition_latency related_cpus scaling_available_governors scaling_cur_freq scaling_driver scaling_governor scaling_max_freq scaling_min_freq scaling_setspeed` — InvisibleShadowGhost, Oct 16 '22 at 07:38
@InvisibleShadowGhost: The other possibility that comes to mind is that 64-byte stores of whole cache lines are triggering some special behaviour, like a no-RFO store protocol with higher latency to hand off the cache lines. (Like `rep stosb` uses, or a bit like `movnt` but not forcing eviction, see [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231)). Maybe test 2x 256-bit loads and shuffle together for a 512-bit store, vs. 512-bit load / convert then store the low/high 256-bit halves? I assume 256-bit vectors are fast? — Peter Cordes, Oct 16 '22 at 07:38
@InvisibleShadowGhost: Oh, well if you can check on clock speed of the relevant core during execution of the test, do that. If it's clocking down a lot for batch=8, that could be it. IDK if Xeons are just different, or maybe your system isn't using Intel-pstate to let the hardware decide what clock speed to run. Or if a different kernel is different from my 5.19 Arch Linux. — Peter Cordes, Oct 16 '22 at 07:40
The clock frequency is a higher when the batch size is 7 (around 2.9 GHz), it is 2.4 GHz both when the batch size is 8 or 9. — InvisibleShadowGhost, Oct 16 '22 at 07:42
Splitting the 512 Bit instruction into two 256 Bit instructions (https://godbolt.org/z/M1j1fshjM) makes the performance a bit better again, but still worse `Elapsed time for a batch size of 7: 199152 Elapsed time for a batch size of 8: 250974 Elapsed time for a batch size of 9: 214520` — InvisibleShadowGhost, Oct 16 '22 at 07:47
@InvisibleShadowGhost: I assume you're using different compiler options locally than you are on Godbolt? Your https://godbolt.org/z/M1j1fshjM link shows `transform_7` auto-vectorizing with 512-bit instructions since you used `-march=cascadelake -mprefer-vector-width=512 -O3`. But it ran fast. Or does it keep the nested loops and only actually run the scalar loop? Since it's not just cleanup, it jumps back to the top of an outer loop after leaving the scalar loop. — Peter Cordes, Oct 16 '22 at 08:02
I don't think so, I compile locally with `clang++ -std=c++20 -march=cascadelake -mprefer-vector-width=512 -O3 minimal.cpp -o minimal` and clang version 14.0.0. — InvisibleShadowGhost, Oct 16 '22 at 08:11
@InvisibleShadowGhost: Ok, so I guess those vector instructions aren't running after the integer code sorts out what to do each outer loop. Seems like a weird choice, adding extra overhead to the `transform_7` inner loop vs. `-fno-tree-vectorize`. (Except clang doesn't seem to respect `-fno-vectorize` or `-fno-tree-vectorize` for that loop! https://godbolt.org/z/c8zTa6ce8) — Peter Cordes, Oct 16 '22 at 08:16
I have now basically reposted that question in the Intel forum (https://community.intel.com/t5/Processors/AVX-512-array-transformation-slower-when-transforming-in-batches/m-p/1425516#M59776), maybe they have an idea... — InvisibleShadowGhost, Oct 27 '22 at 10:59

Why is transforming an array using AVX-512 instructions significantly slower when transforming it in batches of 8 compared to 7 or 9?

1 Answers1

Testing with aligned buffers

Linked