AVX2 simd performs relatively worse to scalar at higher optimization level

Question

I am learning and playing with SIMD functions and wrote a simple program, that compares number of vector addition instruction it can run in 1 second compared with normal scalar addition. I found that SIMD performs relatively better at lower optimization level and consistently much worse at higher optimization levels, and I want to know the reason I used both MSVC and gcc, it is the same story. The following result is from Ryzen 7 CPU. I also tested on a Intel platform, pretty much the same story too.

#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <thread>
#include <atomic>
#include <vector>
#include <immintrin.h>
int main()
{
    const auto threadLimit = std::thread::hardware_concurrency() - 1; //for running main() 
    for (auto i = 1; i <= threadLimit; ++i)
    {
        std::cerr << "Testing " << i << " threads: ";
        std::atomic<unsigned long long> sumScalar {};
        std::atomic<unsigned long long> loopScalar {};
        std::atomic<unsigned long long> sumSimd {};
        std::atomic<unsigned long long> loopSimd {};
        std::atomic_bool stopFlag{ false };
        std::vector<std::thread> threads;
        threads.reserve(i);
        {
            for (auto j = 0; j < i; ++j)
                threads.emplace_back([&]
                    {
                        uint32_t local{};
                        uint32_t loop{};
                        while (!stopFlag)
                        {
                            ++local;
                            ++loop;  //removed this(see EDIT)
                        }
                        sumScalar += local;
                        loopScalar += loop;
                    });
            std::this_thread::sleep_for(std::chrono::seconds{ 1 });
            stopFlag = true;
            for (auto& thread : threads)
                thread.join();
        }
        threads.clear();
        stopFlag = false;
        {
            for (auto j = 0; j < i; ++j)
                threads.emplace_back([&]
                    {
                        const auto oneVec = _mm256_set1_epi32(1);
                        auto local = _mm256_set1_epi32(0);
                        uint32_t inc{};
                        while (!stopFlag)
                        {
                            local = _mm256_add_epi32(oneVec, local);
                            ++inc; //removed this(see EDIT)
                        }
                        sumSimd += std::accumulate(reinterpret_cast<uint32_t*>(&local), reinterpret_cast<uint32_t*>(&local) + 8, uint64_t{});
                        loopSimd += inc;
                    });
            std::this_thread::sleep_for(std::chrono::seconds{ 1 });
            stopFlag = true;
            for (auto& thread : threads)
                thread.join();
        }
        std::cout << "Sum: "<<sumSimd <<" / "<<sumScalar <<"("<<100.0*sumSimd/sumScalar<<"%)\t"<<"Loop: "<<loopSimd<<" / "<<loopScalar<<"("<< 100.0*loopSimd/loopScalar<<"%)\n";
    // SIMD/Scalar, higher value means SIMD better
    }
}

With g++ -O0 -march=native -lpthread, I got:

Testing 1 threads: Sum: 1004405568 / 174344207(576.105%)        Loop: 125550696 / 174344207(72.0131%)
Testing 2 threads: Sum: 2001473960 / 348079929(575.004%)        Loop: 250184245 / 348079929(71.8755%)
Testing 3 threads: Sum: 2991335152 / 521830834(573.238%)        Loop: 373916894 / 521830834(71.6548%)
Testing 4 threads: Sum: 3892119680 / 693704725(561.063%)        Loop: 486514960 / 693704725(70.1329%)
Testing 5 threads: Sum: 4957263080 / 802362140(617.834%)        Loop: 619657885 / 802362140(77.2292%)
Testing 6 threads: Sum: 5417700112 / 953587414(568.139%)        Loop: 677212514 / 953587414(71.0174%)
Testing 7 threads: Sum: 6078496824 / 1067533241(569.396%)       Loop: 759812103 / 1067533241(71.1746%)
Testing 8 threads: Sum: 6679841000 / 1196224828(558.41%)        Loop: 834980125 / 1196224828(69.8013%)
Testing 9 threads: Sum: 7396623960 / 1308004474(565.489%)       Loop: 924577995 / 1308004474(70.6861%)
Testing 10 threads: Sum: 8158849904 / 1416026963(576.179%)      Loop: 1019856238 / 1416026963(72.0224%)
Testing 11 threads: Sum: 8868695984 / 1556964234(569.615%)      Loop: 1108586998 / 1556964234(71.2018%)
Testing 12 threads: Sum: 9441092968 / 1655554694(570.268%)      Loop: 1180136621 / 1655554694(71.2835%)
Testing 13 threads: Sum: 9530295080 / 1689916907(563.951%)      Loop: 1191286885 / 1689916907(70.4938%)
Testing 14 threads: Sum: 10444142536 / 1805583762(578.436%)     Loop: 1305517817 / 1805583762(72.3045%)
Testing 15 threads: Sum: 10834255144 / 1926575218(562.358%)     Loop: 1354281893 / 1926575218(70.2948%)

With g++ -O3 -march=native -lpthread, I got:

Testing 1 threads: Sum: 2933270968 / 3112671000(94.2365%)       Loop: 366658871 / 3112671000(11.7796%)
Testing 2 threads: Sum: 5839842040 / 6177278029(94.5375%)       Loop: 729980255 / 6177278029(11.8172%)
Testing 3 threads: Sum: 8775103584 / 9219587924(95.1789%)       Loop: 1096887948 / 9219587924(11.8974%)
Testing 4 threads: Sum: 11350253944 / 10210948580(111.158%)     Loop: 1418781743 / 10210948580(13.8947%)
Testing 5 threads: Sum: 14487451488 / 14623220822(99.0715%)     Loop: 1810931436 / 14623220822(12.3839%)
Testing 6 threads: Sum: 17141556576 / 14437058094(118.733%)     Loop: 2142694572 / 14437058094(14.8416%)
Testing 7 threads: Sum: 19883362288 / 18313186637(108.574%)     Loop: 2485420286 / 18313186637(13.5718%)
Testing 8 threads: Sum: 22574437968 / 17115166001(131.897%)     Loop: 2821804746 / 17115166001(16.4872%)
Testing 9 threads: Sum: 25356792368 / 18332200070(138.318%)     Loop: 3169599046 / 18332200070(17.2898%)
Testing 10 threads: Sum: 28079398984 / 20747150935(135.341%)    Loop: 3509924873 / 20747150935(16.9176%)
Testing 11 threads: Sum: 30783433560 / 21801526415(141.199%)    Loop: 3847929195 / 21801526415(17.6498%)
Testing 12 threads: Sum: 33420443880 / 22794998080(146.613%)    Loop: 4177555485 / 22794998080(18.3266%)
Testing 13 threads: Sum: 35989535640 / 23596768252(152.519%)    Loop: 4498691955 / 23596768252(19.0649%)
Testing 14 threads: Sum: 38647578408 / 23796083111(162.412%)    Loop: 4830947301 / 23796083111(20.3014%)
Testing 15 threads: Sum: 41148330392 / 24252804239(169.664%)    Loop: 5143541299 / 24252804239(21.208%)

EDIT: After removing the loop variable, leaving just local in both cases (see edit in code), still the same result.

EDIT2: The results above is using GCC 9.3 on Ubuntu. I switched to GCC 10.2 on Windows (mingw), and it shows nice scaling see below (result is the original code). Pretty much can conclude it's MSVC and GCC older version's problem?

Testing 1 threads: Sum: 23752640416 / 3153263747(753.272%)      Loop: 2969080052 / 3153263747(94.159%)
Testing 2 threads: Sum: 46533874656 / 6012052456(774.01%)       Loop: 5816734332 / 6012052456(96.7512%)
Testing 3 threads: Sum: 66076900784 / 9260324764(713.548%)      Loop: 8259612598 / 9260324764(89.1936%)
Testing 4 threads: Sum: 92216030528 / 12229625883(754.038%)     Loop: 11527003816 / 12229625883(94.2548%)
Testing 5 threads: Sum: 111822357864 / 14439219677(774.435%)    Loop: 13977794733 / 14439219677(96.8044%)
Testing 6 threads: Sum: 122858189272 / 17693796489(694.357%)    Loop: 15357273659 / 17693796489(86.7947%)
Testing 7 threads: Sum: 148478021656 / 19618236169(756.837%)    Loop: 18559752707 / 19618236169(94.6046%)
Testing 8 threads: Sum: 156931719736 / 19770409566(793.771%)    Loop: 19616464967 / 19770409566(99.2213%)
Testing 9 threads: Sum: 143331726552 / 20753115024(690.652%)    Loop: 17916465819 / 20753115024(86.3315%)
Testing 10 threads: Sum: 143541178880 / 20331801415(705.993%)   Loop: 17942647360 / 20331801415(88.2492%)
Testing 11 threads: Sum: 160425817888 / 22209102603(722.343%)   Loop: 20053227236 / 22209102603(90.2928%)
Testing 12 threads: Sum: 157095281392 / 23178532051(677.762%)   Loop: 19636910174 / 23178532051(84.7202%)
Testing 13 threads: Sum: 156015224880 / 23818567634(655.015%)   Loop: 19501903110 / 23818567634(81.8769%)
Testing 14 threads: Sum: 145464754912 / 23950304389(607.361%)   Loop: 18183094364 / 23950304389(75.9201%)
Testing 15 threads: Sum: 149279587872 / 23585183977(632.938%)   Loop: 18659948484 / 23585183977(79.1172%)

@AlexLarionov I think you made a mistake. Both scalar and SIMD improves at ``-O3`` compared to ``-O0``, but SIMD instructions runs relatively much slower (means scalar improves much more, while SIMD improves less) relative to scalar instruction at ``-O3``. — sz ppeter, Aug 11 '20 at 15:09
@harold Yes, see my edit after. Practically no difference compared to the original code that has a ``loop`` variable. — sz ppeter, Aug 11 '20 at 15:55
Have you profiled to check which instruction seems to be taking the most time? — Mgetz, Aug 11 '20 at 16:00
@PeterCordes when I godbolted it an hour or so ago it didn't actually. That's one of my concerns. It probably can't because of the atomic check between iterations of the scalar run. — Mgetz, Aug 11 '20 at 16:32
@Mgetz: I wondered about that, thanks for checking. Yeah, producing a count 8x the number of reads of `stopFlag` would effectively count as optimizing away reads; vectorizing is like unrolling and then rolling up into a vector. I thought that might explain a `-O2` vs. `-O3` difference, but the question actually tested `-O0` debug mode. That wasn't what I expected from "high*er*" optimization. You can say that `-O0` isn't truly "no optimization" because GCC always does stuff within expressions, and/or the phrase is meaningless, but `-O0` has different bottlenecks. — Peter Cordes, Aug 11 '20 at 16:43
Possibly strict-aliasing UB is a problem when casting `uint32_t&` onto `__m256i` (a GNU C native vector typedef based on `unsigned long long`)? Or maybe there's a hyperthreading / SMT difference due to `-O0` code having a latency bottleneck vs. `-O3` code having a throughput bottleneck. — Peter Cordes, Aug 11 '20 at 16:44
Oh weird, the loops in the lambda functions keep reloading the *address* of the atomic bool https://godbolt.org/z/6r4oMY. — Peter Cordes, Aug 11 '20 at 16:49
On my Intel Skylake, i7-6700k, I get 267% for every number of threads, with g++ 10.1 `-O3 -march=skylake`, Arch GNU/Linux, energy_performance_preference=balance_power (max clocks = 3.9GHz with any # of cores active). With `++loop` commented, I get the expected 800 +- 1 %, with scalar and vector loops having the same number of uops. (4, probably running at 1 iteration per cycle). With an extra `inc` in the SIMD loop, it becomes 5 uops, and probably suffers from some nasty front-end effect. **What exact CPU models did you test on?** Was it Zen1 where `vpaddd ymm` decodes to 2 uops? — Peter Cordes, Aug 11 '20 at 16:54
@PeterCordes Sorry I am not expert in assembly, but I appreciate it if you can take a look at ``-S`` assembly output with ``-O3 -march=native`` flag without the loop variable in both cases, from my machine [here](https://hastebin.com/rohesodami.pl). I am using a Ryzen 1700 CPU, with g++ 9.3. — sz ppeter, Aug 11 '20 at 17:06
@szppeter we're referring to this https://godbolt.org/z/b7xra8 in effect just for reference. — Mgetz, Aug 11 '20 at 17:10
And the intel (E5-2670V3) version [here](https://hastebin.com/etisiwugag.pl), it's GCC 4.85 with ``-O3 -march=native -lpthread`` flag — sz ppeter, Aug 11 '20 at 17:12
Oh weird, gcc9.3 is storing/reloading your vector inside the loop, but keeping the scalar in a register for `inc eax`! Reproducible on Godbolt https://godbolt.org/z/G73TEj. That's really surprising for `-O3`, and a missed optimization because GCC10 doesn't do that. And apparently the store-forwarding latency bottleneck is almost exactly 8x longer than the scalar loop's 1/clock speed. — Peter Cordes, Aug 11 '20 at 17:12
@PeterCordes for what it's worth the scalar version will always be faster even if you increase the amount of work to match. The reason being that the compiler can't reason about the final result in the vectorized version where as if you do `local2` etc to increase the work load then it just multiplies the result by 4 later. E.g. this is just bad comparison. — Mgetz, Aug 11 '20 at 17:18
@Mgetz: Turns out there was an interesting effect here, with the pointer-casting hsum leading to worse code in the preceding loop. Agreed the actual benchmark is not very useful; `++` vs `_mm256_add_epi32` depends on surrounding code and isn't something you can measure once and apply everywhere. But it did reveal an interesting GCC9 missed optimization. — Peter Cordes, Aug 11 '20 at 18:09

score 5 · Accepted Answer · answered Aug 11 '20 at 18:07

reinterpret_cast<uint32_t*>(&local) after the loop is getting GCC9 to store/reload local inside the loop, creating a store-forwarding bottleneck.

This is already fixed in GCC10; no need to file a missed-optimization bug. Don't cast pointers onto __m256i locals; it also violates strict-aliasing so it's Undefined Behaviour without -fno-strict-aliasing even though GCC often makes it work. (You can point __m256i* at any other type, but not vice versa.)

gcc9.3 (which you're using) is storing/reloading your vector inside the loop, but keeping the scalar in a register for inc eax!

The vector loop thus bottlenecks on the latency of vector store-forwarding plus vpaddd, and that happens to be just over 8x slower than the scalar loop. Their bottlenecks are unrelated, being close to 1x total speed is just coincidence.

(The scalar loop presumably runs at 1 cycle per iteration on Zen1 or Skylake, and 7 cycle store-forwarding plus 1 for vpaddd sounds about right).

It's indirectly caused by reinterpret_cast<uint32_t*>(&local), either because of GCC trying to be forgiving of the strict-aliasing undefined-behaviour violation, or just because you're taking a pointer to the local at all.

This is not normal or expected, but the combination of the atomic load inside the inner loop and maybe the lambda confuse GCC9 into making this mistake. (Note that GCC9 and 10 are reloading the address of stopFlag from the thread function arg inside the loop, even for scalar, so there's already some failure to keep things in registers.)

In normal use-cases, you'll be doing more SIMD work per check of a stop flag, and often you wouldn't be keeping vector state across iterations. And usually you'll have a non-atomic arg that tells you how much work to do, not a stop-flag you check inside the inner loop. So this missed-opt bug is rarely a problem. (Unless it happens even without an atomic flag?)

Reproducible on Godbolt, showing -DUB_TYPEPUN vs. -UUB_TYPEPUN for source where I used #ifdef to use your unsafe (and missed-opt-triggering) version vs. a safe one with manually-vectorized shuffles from Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2. (That manual hsum doesn't widen before adding so it could overflow and wrap. But that's not the point; using different manual shuffles, or _mm256_store_si256 to a separate array, would be possible to get the result you want without strict-aliasing undefined behaviour.)

The scalar loop is:

# g++9.3 -O3 -march=znver1
.L5:                                      # do{
        inc     eax                         # local++
.L3:
        mov     rdx, QWORD PTR [rdi+8]      # load the address of stopFlag from the lambda
        movzx   edx, BYTE PTR [rdx]         # zero-extend *&stopFlag into EDX
        test    dl, dl
        je      .L5                       # }while(stopFlag == 0)

The vector loop, with g++ 9.3, -O3 -march=znver1, using your reinterpret_cast (i.e. -DUB_TYPEPUN in my version of the source):

# g++9.3 -O3 -march=znver1  with your pointer-cast onto the vector

 # ... ymm1 = _mm256_set1_epi32(1)
.L10:                                               # do {
        vpaddd  ymm1, ymm0, YMMWORD PTR [rsp-32]       # memory-source add with set1(1)
        vmovdqa YMMWORD PTR [rsp-32], ymm1             # store back into stack memory
.L8:
        mov     rax, QWORD PTR [rdi+8]                  # load flag address
        movzx   eax, BYTE PTR [rax]                     # load stopFlag
        test    al, al
        je      .L10                                # }while(stopFlag == 0)

... auto-vectorized hsum, zero-extending elements to 64-bit for vpaddq

But with a safe __m256i horizontal sum that avoids a pointer onto local at all, local stays in a register.

#      ymm1 = _mm256_set1_epi32(1)
.L9:
        vpaddd  ymm0, ymm1, ymm0             # local += set1(1),  staying in a register, ymm0
.L8:
        mov     rax, QWORD PTR [rdi+8]       # same loop overhead, still 3 uops (with fusion of test/je)
        movzx   eax, BYTE PTR [rax]
        test    al, al
        je      .L9

... manually-vectorized 32-bit hsum

On my Intel Skylake, i7-6700k, I get the expected 800 +- 1% for every number of threads, with g++ 10.1 -O3 -march=skylake, Arch GNU/Linux, energy_performance_preference=balance_power (max clocks = 3.9GHz with any # of cores active).

Scalar and vector loops having the same number of uops and no different bottlenecks, so they run at identical cycles / iteration. (4, perhaps running at 1 iteration per cycle if it can keep those address -> value chains of stopflag loads in flight).

Zen1 could be different because vpaddd ymm is 2 uops. But its front-end is wide enough to probably still run that loop at 1 cycle per iteration so you might see 800% there, too.

With ++loop uncommented, I get ~267% "SIMD speed". With an extra inc in the SIMD loop, it becomes 5 uops, and probably suffers from some nasty front-end effect on Skylake.

-O0 benchmarking is meaningless in general, it has different bottlenecks (usually store/reload from keeping everything in memory), and SIMD intrinsics usually have a lot of extra overhead at -O0. Although in this case, even -O3 was bottlenecking on store/reload for the SIMD loop.

AVX2 simd performs relatively worse to scalar at higher optimization level

1 Answers1

Linked