8

I want to make code below parallelized:

for(int c=0; c<n; ++c) {
    Work(someArray, c);
}

I've done it this way:

#include <thread>
#include <vector>

auto iterationsPerCore = n/numCPU;
std::vector<std::future<void>> futures;

for(auto th = 0; th < numCPU; ++th) {
    for(auto n = th * iterationsPerCore; n < (th+1) * iterationsPerCore; ++n) {
        auto ftr = std::async( std::launch::deferred | std::launch::async,
            [n, iterationsPerCore, someArray]()
            {
                for(auto m = n; m < n + iterationsPerCore; ++m)
                    Work(someArray, m);
            }
        );
        futures.push_back(std::move(ftr));
    }

    for(auto& ftr : futures)
        ftr.wait();
}

// rest of iterations: n%iterationsPerCore
for(auto r = numCPU * iterationsPerCore; r < n; ++r)
    Work(someArray, r);

Problem is that it runs only 50% faster on Intel CPU, while on AMD it does 300% faster. I run it on three Intel CPUs (Nehalem 2core+HT, Sandy Bridge 2core+HT, Ivy Brigde 4core+HT). AMD processor is Phenom II x2 with 4 cores unlocked. On 2-core Intel processor it runs 50% faster with 4 threads. On 4-core, it runs 50% faster also on 4 threads. I'm testing with VS2012, Windows 7.

When I try it with 8 threads, it is 8x slower than serial loop on Intel. I suppose it is caused by HT.

What do you think about it? What's the reason of such behavior? Maybe code is not correct?

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
Michal
  • 163
  • 1
  • 8
  • Hyperthreading makes some workloads run slower when it is on. This may be one of these. Do a test with HT off so you restrict yourself to only actual cores. – Donnie Dec 12 '12 at 19:53
  • Which compiler did you use on which platform? – Stephan Dollberg Dec 12 '12 at 19:56
  • 1
    @Donnie On 2-core Intel processor it runs 50% faster with 4 threads. On 4-core, it runs 50% faster also on 4 threads. When I try it with 8 threads, it is 8x slower than serial loop. Tomorrow I'm going to switch HT off and check the results. Currently I have no access to computer with Intel CPU. – Michal Dec 12 '12 at 19:57
  • @bamboon VS2012, Windows 7. – Michal Dec 12 '12 at 19:58
  • try with `std::launch::async` only to not mess with the runtime scheduler. Also, how large is your computation size? – Stephan Dollberg Dec 12 '12 at 20:01
  • @bamboon I tried it, as well as raw std::thread, but performance went down ~5-10%. Size is different, n = <80, 10000>. – Michal Dec 12 '12 at 20:06
  • What do you mean by different sizes, different from Intel to AMD? By size I meant more like how long does your benchmark take, is it microseconds or seconds? – Stephan Dollberg Dec 12 '12 at 20:10
  • @bamboon different sizes of someArray. Full loop takes 1-10 seconds, it depends on processed array size. – Michal Dec 12 '12 at 20:12
  • Does the Phenom have a larger cache? – Collin Dec 12 '12 at 20:14
  • @Collin No, it doesn't. One of CPUs I test this code with is Intel i7-3770K. – Michal Dec 12 '12 at 20:22
  • 2
    Concrete informations about the platform (which processors specifically were used, what compiler, the number of workitems you generate and how long the computation takes on each platform might help a lot in answering the question, so you really should edit those infos into your question – Grizzly Dec 12 '12 at 22:20
  • have a read here http://stackoverflow.com/questions/10939158/openmp-performance/11122648#comment17622454_11122648 – camelccc Nov 12 '13 at 23:40

3 Answers3

5

I'd suspect false sharing. This is what happens when two variables share the same cache line. Effectively, all operations on them have to be very expensively synchronized even if they are not accessed concurrently, as the cache can only operate in terms of cache lines of a certain size, even if your operations are more fine-grained. I would suspect that the AMD hardware is simply more resilient or has a different hardware design to cope with this.

To test, change the code so that each core only works on chunks which are multiples of 64bytes. This should avoid any cache line sharing, as the Intel CPUs only have a cache line of 64bytes.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • This was also what I was initially suspecting. But a second glance shows that both the Intel and the AMD machine have the same cache-line size. +1 anyways. – Mysticial Dec 12 '12 at 22:30
  • Not just cache _size_ but whether the L2 and L3 caches are shared between cores, and also possibly inclusive vs exclusive caches, which differ between AMD and Intel – Jonathan Wakely Dec 12 '12 at 22:40
  • Yeah- there's more going on in false sharing than just the cache line. – Puppy Dec 12 '12 at 23:39
2

I would say you need to change your compiler settings to make all the compiled code minimize the number of branches. The two different CPU styles have different operation look-ahead setups. You need to change the compiler optimization settings to match the target CPU, not the CPU upon which the code is compiled.

Zagrev
  • 2,000
  • 11
  • 8
  • How can I do it? Can you recommend me some resources about it? – Michal Dec 12 '12 at 21:15
  • 3
    First, I'd use GCC instead of VS2012. But, here's a link to some documented options http://msdn.microsoft.com/en-us/library/19z1t1wy.aspx You can check options like /MT /favor /Gr – Zagrev Dec 12 '12 at 21:20
  • How could bad sequential optimization harm parallel speedup? – usr Dec 12 '12 at 22:19
  • The worst thing would probably be the compiler's plan for the look ahead branch optimization. So the compiler thinks that the code will always proceed down a particular branch, but the parallel processes change the environment on during execution, causing a different path to be taken, and causing all the look ahead code to be dumped. This is actually the reason for the volatile keyword. It keeps the compiler from optimizing based on that variable calculated values. – Zagrev Dec 14 '12 at 07:52
2

You sould also be awer of the cpu cache. Here is a good article on this topic.

The short version: the hw caches the data, but if you are working on the same memory (SomeArray) it has to sync all the time between the caches of the cpus, it can even cause to run slower then in a single threaded way.

Kocka
  • 1,634
  • 2
  • 13
  • 21