I am learning and playing with SIMD functions and wrote a simple program, that compares number of vector addition instruction it can run in 1 second compared with normal scalar addition. I found that SIMD performs relatively better at lower optimization level and consistently much worse at higher optimization levels, and I want to know the reason I used both MSVC and gcc, it is the same story. The following result is from Ryzen 7 CPU. I also tested on a Intel platform, pretty much the same story too.
#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <thread>
#include <atomic>
#include <vector>
#include <immintrin.h>
int main()
{
const auto threadLimit = std::thread::hardware_concurrency() - 1; //for running main()
for (auto i = 1; i <= threadLimit; ++i)
{
std::cerr << "Testing " << i << " threads: ";
std::atomic<unsigned long long> sumScalar {};
std::atomic<unsigned long long> loopScalar {};
std::atomic<unsigned long long> sumSimd {};
std::atomic<unsigned long long> loopSimd {};
std::atomic_bool stopFlag{ false };
std::vector<std::thread> threads;
threads.reserve(i);
{
for (auto j = 0; j < i; ++j)
threads.emplace_back([&]
{
uint32_t local{};
uint32_t loop{};
while (!stopFlag)
{
++local;
++loop; //removed this(see EDIT)
}
sumScalar += local;
loopScalar += loop;
});
std::this_thread::sleep_for(std::chrono::seconds{ 1 });
stopFlag = true;
for (auto& thread : threads)
thread.join();
}
threads.clear();
stopFlag = false;
{
for (auto j = 0; j < i; ++j)
threads.emplace_back([&]
{
const auto oneVec = _mm256_set1_epi32(1);
auto local = _mm256_set1_epi32(0);
uint32_t inc{};
while (!stopFlag)
{
local = _mm256_add_epi32(oneVec, local);
++inc; //removed this(see EDIT)
}
sumSimd += std::accumulate(reinterpret_cast<uint32_t*>(&local), reinterpret_cast<uint32_t*>(&local) + 8, uint64_t{});
loopSimd += inc;
});
std::this_thread::sleep_for(std::chrono::seconds{ 1 });
stopFlag = true;
for (auto& thread : threads)
thread.join();
}
std::cout << "Sum: "<<sumSimd <<" / "<<sumScalar <<"("<<100.0*sumSimd/sumScalar<<"%)\t"<<"Loop: "<<loopSimd<<" / "<<loopScalar<<"("<< 100.0*loopSimd/loopScalar<<"%)\n";
// SIMD/Scalar, higher value means SIMD better
}
}
With g++ -O0 -march=native -lpthread
, I got:
Testing 1 threads: Sum: 1004405568 / 174344207(576.105%) Loop: 125550696 / 174344207(72.0131%)
Testing 2 threads: Sum: 2001473960 / 348079929(575.004%) Loop: 250184245 / 348079929(71.8755%)
Testing 3 threads: Sum: 2991335152 / 521830834(573.238%) Loop: 373916894 / 521830834(71.6548%)
Testing 4 threads: Sum: 3892119680 / 693704725(561.063%) Loop: 486514960 / 693704725(70.1329%)
Testing 5 threads: Sum: 4957263080 / 802362140(617.834%) Loop: 619657885 / 802362140(77.2292%)
Testing 6 threads: Sum: 5417700112 / 953587414(568.139%) Loop: 677212514 / 953587414(71.0174%)
Testing 7 threads: Sum: 6078496824 / 1067533241(569.396%) Loop: 759812103 / 1067533241(71.1746%)
Testing 8 threads: Sum: 6679841000 / 1196224828(558.41%) Loop: 834980125 / 1196224828(69.8013%)
Testing 9 threads: Sum: 7396623960 / 1308004474(565.489%) Loop: 924577995 / 1308004474(70.6861%)
Testing 10 threads: Sum: 8158849904 / 1416026963(576.179%) Loop: 1019856238 / 1416026963(72.0224%)
Testing 11 threads: Sum: 8868695984 / 1556964234(569.615%) Loop: 1108586998 / 1556964234(71.2018%)
Testing 12 threads: Sum: 9441092968 / 1655554694(570.268%) Loop: 1180136621 / 1655554694(71.2835%)
Testing 13 threads: Sum: 9530295080 / 1689916907(563.951%) Loop: 1191286885 / 1689916907(70.4938%)
Testing 14 threads: Sum: 10444142536 / 1805583762(578.436%) Loop: 1305517817 / 1805583762(72.3045%)
Testing 15 threads: Sum: 10834255144 / 1926575218(562.358%) Loop: 1354281893 / 1926575218(70.2948%)
With g++ -O3 -march=native -lpthread
, I got:
Testing 1 threads: Sum: 2933270968 / 3112671000(94.2365%) Loop: 366658871 / 3112671000(11.7796%)
Testing 2 threads: Sum: 5839842040 / 6177278029(94.5375%) Loop: 729980255 / 6177278029(11.8172%)
Testing 3 threads: Sum: 8775103584 / 9219587924(95.1789%) Loop: 1096887948 / 9219587924(11.8974%)
Testing 4 threads: Sum: 11350253944 / 10210948580(111.158%) Loop: 1418781743 / 10210948580(13.8947%)
Testing 5 threads: Sum: 14487451488 / 14623220822(99.0715%) Loop: 1810931436 / 14623220822(12.3839%)
Testing 6 threads: Sum: 17141556576 / 14437058094(118.733%) Loop: 2142694572 / 14437058094(14.8416%)
Testing 7 threads: Sum: 19883362288 / 18313186637(108.574%) Loop: 2485420286 / 18313186637(13.5718%)
Testing 8 threads: Sum: 22574437968 / 17115166001(131.897%) Loop: 2821804746 / 17115166001(16.4872%)
Testing 9 threads: Sum: 25356792368 / 18332200070(138.318%) Loop: 3169599046 / 18332200070(17.2898%)
Testing 10 threads: Sum: 28079398984 / 20747150935(135.341%) Loop: 3509924873 / 20747150935(16.9176%)
Testing 11 threads: Sum: 30783433560 / 21801526415(141.199%) Loop: 3847929195 / 21801526415(17.6498%)
Testing 12 threads: Sum: 33420443880 / 22794998080(146.613%) Loop: 4177555485 / 22794998080(18.3266%)
Testing 13 threads: Sum: 35989535640 / 23596768252(152.519%) Loop: 4498691955 / 23596768252(19.0649%)
Testing 14 threads: Sum: 38647578408 / 23796083111(162.412%) Loop: 4830947301 / 23796083111(20.3014%)
Testing 15 threads: Sum: 41148330392 / 24252804239(169.664%) Loop: 5143541299 / 24252804239(21.208%)
EDIT: After removing the loop
variable, leaving just local
in both cases (see edit in code), still the same result.
EDIT2: The results above is using GCC 9.3 on Ubuntu. I switched to GCC 10.2 on Windows (mingw), and it shows nice scaling see below (result is the original code). Pretty much can conclude it's MSVC and GCC older version's problem?
Testing 1 threads: Sum: 23752640416 / 3153263747(753.272%) Loop: 2969080052 / 3153263747(94.159%)
Testing 2 threads: Sum: 46533874656 / 6012052456(774.01%) Loop: 5816734332 / 6012052456(96.7512%)
Testing 3 threads: Sum: 66076900784 / 9260324764(713.548%) Loop: 8259612598 / 9260324764(89.1936%)
Testing 4 threads: Sum: 92216030528 / 12229625883(754.038%) Loop: 11527003816 / 12229625883(94.2548%)
Testing 5 threads: Sum: 111822357864 / 14439219677(774.435%) Loop: 13977794733 / 14439219677(96.8044%)
Testing 6 threads: Sum: 122858189272 / 17693796489(694.357%) Loop: 15357273659 / 17693796489(86.7947%)
Testing 7 threads: Sum: 148478021656 / 19618236169(756.837%) Loop: 18559752707 / 19618236169(94.6046%)
Testing 8 threads: Sum: 156931719736 / 19770409566(793.771%) Loop: 19616464967 / 19770409566(99.2213%)
Testing 9 threads: Sum: 143331726552 / 20753115024(690.652%) Loop: 17916465819 / 20753115024(86.3315%)
Testing 10 threads: Sum: 143541178880 / 20331801415(705.993%) Loop: 17942647360 / 20331801415(88.2492%)
Testing 11 threads: Sum: 160425817888 / 22209102603(722.343%) Loop: 20053227236 / 22209102603(90.2928%)
Testing 12 threads: Sum: 157095281392 / 23178532051(677.762%) Loop: 19636910174 / 23178532051(84.7202%)
Testing 13 threads: Sum: 156015224880 / 23818567634(655.015%) Loop: 19501903110 / 23818567634(81.8769%)
Testing 14 threads: Sum: 145464754912 / 23950304389(607.361%) Loop: 18183094364 / 23950304389(75.9201%)
Testing 15 threads: Sum: 149279587872 / 23585183977(632.938%) Loop: 18659948484 / 23585183977(79.1172%)