Multiply-add vectorization slower with AVX than with SSE

Question

I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this:

for( int i = 0; i < size; i++ )
{
    c[i] += (double)a[i] * (double)b[i];
}

Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why.

Here's the benchmark code:

#include <chrono>
#include <cstdio>
#include <cstdlib>

int main()
{
    int size = 1 << 20;

    float *a = new float[size];
    float *b = new float[size];
    double *c = new double[size];

    for (int i = 0; i < size; i++)
    {
        a[i] = rand();
        b[i] = rand();
        c[i] = rand();
    }

    for (int j = 0; j < 10; j++)
    {
        auto begin = std::chrono::high_resolution_clock::now();

        for( int i = 0; i < size; i++ )
        {
            c[i] += (double)a[i] * (double)b[i];
        }

        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();

        printf("%lluus\n", duration);
    }
}

Here's the generated assembly under SSE:

0x100007340 <+144>:  cvtps2pd (%r13,%rbx,4), %xmm0
0x100007346 <+150>:  cvtps2pd 0x8(%r13,%rbx,4), %xmm1
0x10000734c <+156>:  cvtps2pd (%r15,%rbx,4), %xmm2
0x100007351 <+161>:  mulpd  %xmm0, %xmm2
0x100007355 <+165>:  cvtps2pd 0x8(%r15,%rbx,4), %xmm0
0x10000735b <+171>:  mulpd  %xmm1, %xmm0
0x10000735f <+175>:  movupd (%r14,%rbx,8), %xmm1
0x100007365 <+181>:  addpd  %xmm2, %xmm1
0x100007369 <+185>:  movupd 0x10(%r14,%rbx,8), %xmm2
0x100007370 <+192>:  addpd  %xmm0, %xmm2
0x100007374 <+196>:  movupd %xmm1, (%r14,%rbx,8)
0x10000737a <+202>:  movupd %xmm2, 0x10(%r14,%rbx,8)
0x100007381 <+209>:  addq   $0x4, %rbx
0x100007385 <+213>:  cmpq   $0x100000, %rbx           ; imm = 0x100000 
0x10000738c <+220>:  jne    0x100007340               ; <+144> at main.cpp:26:20

Results from running SSE benchmark:

1411us
1246us
1243us
1267us
1242us
1237us
1246us
1242us
1250us
1229us

Generated assembly with AVX enabled:

0x1000070b0 <+144>:  vcvtps2pd (%r13,%rbx,4), %ymm0
0x1000070b7 <+151>:  vcvtps2pd 0x10(%r13,%rbx,4), %ymm1
0x1000070be <+158>:  vcvtps2pd 0x20(%r13,%rbx,4), %ymm2
0x1000070c5 <+165>:  vcvtps2pd 0x30(%r13,%rbx,4), %ymm3
0x1000070cc <+172>:  vcvtps2pd (%r15,%rbx,4), %ymm4
0x1000070d2 <+178>:  vmulpd %ymm4, %ymm0, %ymm0
0x1000070d6 <+182>:  vcvtps2pd 0x10(%r15,%rbx,4), %ymm4
0x1000070dd <+189>:  vmulpd %ymm4, %ymm1, %ymm1
0x1000070e1 <+193>:  vcvtps2pd 0x20(%r15,%rbx,4), %ymm4
0x1000070e8 <+200>:  vcvtps2pd 0x30(%r15,%rbx,4), %ymm5
0x1000070ef <+207>:  vmulpd %ymm4, %ymm2, %ymm2
0x1000070f3 <+211>:  vmulpd %ymm5, %ymm3, %ymm3
0x1000070f7 <+215>:  vaddpd (%r14,%rbx,8), %ymm0, %ymm0
0x1000070fd <+221>:  vaddpd 0x20(%r14,%rbx,8), %ymm1, %ymm1
0x100007104 <+228>:  vaddpd 0x40(%r14,%rbx,8), %ymm2, %ymm2
0x10000710b <+235>:  vaddpd 0x60(%r14,%rbx,8), %ymm3, %ymm3
0x100007112 <+242>:  vmovupd %ymm0, (%r14,%rbx,8)
0x100007118 <+248>:  vmovupd %ymm1, 0x20(%r14,%rbx,8)
0x10000711f <+255>:  vmovupd %ymm2, 0x40(%r14,%rbx,8)
0x100007126 <+262>:  vmovupd %ymm3, 0x60(%r14,%rbx,8)
0x10000712d <+269>:  addq   $0x10, %rbx
0x100007131 <+273>:  cmpq   $0x100000, %rbx           ; imm = 0x100000 
0x100007138 <+280>:  jne    0x1000070b0               ; <+144> at main.cpp:26:20

Results from running AVX benchmark:

1532us
1404us
1480us
1464us
1410us
1383us
1333us
1362us
1494us
1526us

Note that AVX code being generated with twice as many instructions as SSE doesn't really matter - I've tried smaller unroll by hand (to match SSE) and AVX was still slower.

For context, I'm using macOS 11 and Xcode 12, with Mac Pro 6.1 (trashcan) with Intel Xeon CPU E5-1650 v2 @ 3.50GHz.

I don't have access to perf since I'm not using Linux. – prazuber Dec 17 '20 at 23:00 — prazuber, Dec 17 '20 at 23:00

Peter Cordes · Answer 1 · 2020-12-14T23:45:32.327

Update: alignment didn't help much/at all. There may also be another bottleneck, e.g. in packed float->double conversion? Also, vcvtps2pd (%r13,%rbx,4), %ymm0 only has a 16-byte memory source, so only the stores are 32-byte. We don't have any 32-byte split loads. (I wrote the answer below before looking carefully enough at the code.)

That's an IvyBridge CPU. Is your data aligned by 32? If not, it's a known fact that cache-line splits on 32 byte loads or stores are a serious bottleneck for those old microarchitectures. Those early Intel AVX-supporting CPUs have full width ALUs, but they run 32-byte load and store as 2 separate data cycles in the execution units from the same uop¹, making a cache-line split an extra special (and extra slow) case. (https://www.realworldtech.com/sandy-bridge/7/). Unlike Haswell (and Zen 2) and later which have 32-byte data paths².

So slow that GCC's default -mtune=generic code-generation will even split 256-bit AVX loads and stores that aren't known at compile time to be aligned. (This is way overkill and hurts especially on newer CPUs, and/or when the data actually is aligned but the compiler doesn't know it, or when data is aligned in the common case, but the function needs to still work on the occasional misaligned array, letting the hardware deal with that special case instead of running more instructions in the common case to even check for that special case.)

But you're using clang, which makes somewhat nice code here (unrolled by 4x) that would perform well with aligned arrays, or on a newer CPU like Haswell. Unfortunately it uses indexed addressing modes, defeating much of the purpose of unrolling (for Intel Sandybridge / Ivy Bridge especially) because the load and ALU uop will unlaminate and go through the front-end separately. Micro fusion and addressing modes. (Haswell can keep some of them micro-fused for the SSE case but not AVX, e.g. the stores.)

You can use aligned_alloc, or maybe do something with C++17 aligned new to get an aligned allocation that's compatible with delete.

Plain new may be giving you a pointer aligned by 16, but misaligned by 32. I don't know about MacOS, but on Linux glibc's allocator for large-ish allocations typically keeps 16 bytes for bookkeeping at the start of a page, so you typically get large allocations that are 16 bytes away from alignment by anything larger than 16.

Footnote 2: That single-uop that spends a 2nd cycle in a load port still only does address-generation once. That allows another uop (e.g. a store-address uop) to use the AGU while the 2nd data cycle is happening. So it doesn't interfere with the address-handling part being fully pipelined.

SnB / IvB only have 2 AGU/load ports, so normally can execute up to 2 memory operations per clock, up to one of which is a store. But with 32-byte loads and stores only needing an address every 2 data cycles (and store-data already being a separate uop for another port from store-address), that allows SnB / IvB to achieve 2 + 1 load+store per clock, sustained, for the special case of 32-byte loads and stores. (That uses most of the front-end bandwidth, so those loads typically need to be micro-fused as a memory source operand for another instruction.)

See also my answer on How can cache be that fast? on electronics.SE.

Footnote 1: Zen 1 (and Bulldozer-family) decode all 32-byte operations to 2 separate uops so there's no special case. One half of the load can be split across the cache line, and that would be exactly like a 16-byte load that came from an xmm load.

Thanks for your answer! I've tried benchmark code with aligned_alloc(32, array_size) and didn't get any improvement, unfortunately. Just to clarify - did I understand correctly that I should've expected improvement or is my CPU just incapable of running this AVX code efficiently? — prazuber, Dec 14 '20 at 23:39
@prazuber: I would have expected that to help, or at least make them both equal (if bottlenecked on memory bandwidth). Try a smaller total size (that fits in L3 cache, or even L2) to create a situation where AVX can significantly help. Although packed float->double conversion requires shuffling, that may be a bottleneck, too. — Peter Cordes, Dec 14 '20 at 23:41
Amusingly, changing floats to doubles made it almost twice as slow for both SSE and AVX, with no improvement between the two. However, reducing array sizes actually helped! Having only 32768 elements in each array nets about 10% improvement with AVX, and with 16384 elements it's almost 20% faster. Does it mean I have to do manual prefetching to get this extra speed? — prazuber, Dec 15 '20 at 00:32
@prazuber: What it means is that you should cache-block for problem sizes that fit in L2 cache, if you're doing something with computational intensity as low as this. Or just do more work per time you load the data. But yeah, manual prefetch (or asking the compiler to emit prefetch instructions) might recover some performance for the large array case in this microbenchmark; HW prefetch doesn't work perfectly (or at all) across 4k page boundaries, so OoO exec of demand loads may be part of what actually does happen at page boundaries. — Peter Cordes, Dec 15 '20 at 00:41
@prazuber: Note that many-core Xeon CPUs have notoriously low single-threaded memory bandwidth, compared to same-generation quad-core "client" chips: [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](//stackoverflow.com/q/39260020). Only the aggregate bandwidth is good with a big Xeon, and it takes multiple threads to get even as much bandwidth as 1 thread on a desktop, let alone saturate the higher total bandwidth with more memory controllers. Memory performance tuning is hard... [What Every Prog.. Should Know About Mem.](//stackoverflow.com/q/8126311) — Peter Cordes, Dec 15 '20 at 00:43
@prazuber: Oh, I just remembered, software prefetch would very likely *hurt* on your CPU. It's an Ivy Bridge, so it unfortunately has a performance bug: `prefetcht*` instructions have much worse throughput than they should. https://www.agner.org/optimize/blog/read.php?i=285&v=t. BTW, you'll probably see a mention of Ivy Bridge's "next page prefetcher" if you go digging into mem performance. AFAIK that's just a *TLB* speculative prefetch (page-walk), not data prefetching. (Contiguous virtual pages aren't necessarily physically contiguous, and HW prefetch only has access to physical addrs) — Peter Cordes, Dec 15 '20 at 00:51
Thanks for valuable info Peter! I'll try some more experiments with prefetching but the root cause seems to be clear. I'll go ahead and accept this answer. — prazuber, Dec 15 '20 at 00:59
@prazuber: I'd actually recommend that you *don't* accept this answer; it has some useful stuff (and thus deserves the upvotes it has) but it doesn't truly answer the question. There must be some microarchitectural reason why AVX is an actual slowdown, and 32-byte unaligned stores is apparently not it, given your results with `aligned_alloc`. The loads of 8 or 16 bytes should all be aligned, with `new` returning 16-byte aligned memory. (Which it should, given `alignof(max_align_t) == 16` in x86-64 System V. Unless it handles `new double[]` separately than e.g. `new long double[]`.) — Peter Cordes, Dec 15 '20 at 01:04

score 3 · Answer 2 · answered Dec 16 '20 at 00:52

If I assume that the 16MiB of data accessed effectively flushes the 12MiB LLC, then effectively all traffic is to/from DRAM. Counting the reads of all three arrays and the writeback of c, the fastest SSE time corresponds to 20.48 GB/s of DRAM bandwidth, while the fastest AVX time corresponds to 18.88 GB/s of DRAM bandwidth. These are both similar to the ~19.5 GB/s best case single-thread bandwidths (with the same R:W ratios) that I measured on a Xeon E5-2680 v2.

Two thoughts:

It is generally very difficult to come to a satisfactory understanding of performance when the working set is so close to the LLC size. There are many, many reasons for this, including the interaction between the physical addresses used and the undocumented hash function that Intel uses to map addresses to LLC slices, as well as the rather bizarre dynamic LLC behavior documented at https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
For memory-bandwidth-limited code, both Sandy Bridge EP and Ivy Bridge EP delivered better sustained memory bandwidth using scalar code than 128-bit SSE vector code, which was in turn faster than 256-bit AVX vector code. In my observations, the effect is typically smaller than the difference you are seeing -- maybe 3-5% vs your ~13% median difference, but it may be larger at these sizes? I am not aware of any completely satisfactory explanation of this performance anomaly, but my interpretation is that "narrower" loads cause the L1 HW prefetcher to activate earlier, which gets the L2 HW prefetchers started earlier. Given that the transfer time for a 4KiB page is smaller than the latency for the initial load, the HW prefetchers are almost always in the early startup phase, where small changes in prefetch timing can be important....

The OP may have enough ALU uops in the pipeline that it makes a difference to whatever causes that anomaly you mention in 2. Although I guess you see it with some actual math in your STREAM benchmark, but the OP's compiler defeats micro-fusion of the load+convert and the store, and there's a multiply uop in the middle. So it's more ALU work than a well-optimized triad, especially given the half-width loads to feed conversion. More uops for the ROB and RS means a shorter lookahead window into a new page. IDK why that would make the difference smaller, not larger, though. — Peter Cordes, Dec 16 '20 at 01:02
TL:DR: as you say, there are many confounding factors in trying to predict / explain anything about this performance based on other measurements or known facts. — Peter Cordes, Dec 16 '20 at 01:06

Multiply-add vectorization slower with AVX than with SSE

2 Answers2

Linked