I want to make code below parallelized:
for(int c=0; c<n; ++c) {
Work(someArray, c);
}
I've done it this way:
#include <thread>
#include <vector>
auto iterationsPerCore = n/numCPU;
std::vector<std::future<void>> futures;
for(auto th = 0; th < numCPU; ++th) {
for(auto n = th * iterationsPerCore; n < (th+1) * iterationsPerCore; ++n) {
auto ftr = std::async( std::launch::deferred | std::launch::async,
[n, iterationsPerCore, someArray]()
{
for(auto m = n; m < n + iterationsPerCore; ++m)
Work(someArray, m);
}
);
futures.push_back(std::move(ftr));
}
for(auto& ftr : futures)
ftr.wait();
}
// rest of iterations: n%iterationsPerCore
for(auto r = numCPU * iterationsPerCore; r < n; ++r)
Work(someArray, r);
Problem is that it runs only 50% faster on Intel CPU, while on AMD it does 300% faster. I run it on three Intel CPUs (Nehalem 2core+HT, Sandy Bridge 2core+HT, Ivy Brigde 4core+HT). AMD processor is Phenom II x2 with 4 cores unlocked. On 2-core Intel processor it runs 50% faster with 4 threads. On 4-core, it runs 50% faster also on 4 threads. I'm testing with VS2012, Windows 7.
When I try it with 8 threads, it is 8x slower than serial loop on Intel. I suppose it is caused by HT.
What do you think about it? What's the reason of such behavior? Maybe code is not correct?