2

consider this recursive multi-threaded program :

#include <iostream>
#include <thread>

#define NUMTHREADS 4
using namespace std;

int g[NUMTHREADS];

thread t[NUMTHREADS];

void task1(int x)
{
    if(x+1<NUMTHREADS)
        t[x] = thread(task1, x+1);
    for(int i=0;i<100000000;i++)
        g[x]++;
    if(x+1<NUMTHREADS)
        t[x].join();
}

int main()
{
    task1(0);
    for(int i=0;i<NUMTHREADS;i++)
        cout<<g[i]<<" ";
}

I'm expecting the thread overhead to be insignificant but in fact the runtime of the program increases linearly with the number of threads.

Here's some timings on my 6-core cpu:

NUMTHREADS = 1:

$ time ./a
100000000
real    0m0.330s
user    0m0.312s
sys     0m0.015s

NUMTHREADS = 2:

$ time ./a
100000000 100000000
real    0m0.742s
user    0m1.404s
sys     0m0.015s

NUMTHREADS = 3:

$ time ./a
100000000 100000000 100000000
real    0m1.038s
user    0m2.792s
sys     0m0.000s

NUMTHREADS = 4:

$ time ./a
100000000 100000000 100000000 100000000
real    0m1.511s
user    0m5.616s
sys     0m0.015s

Any idea why this might be?

user816318
  • 55
  • 1
  • 9
  • Your assumption is wrong, the overhead IS significant. Comparing performance without optimizations is almost meaningless, and just about any basic optimizer will replace that trivial loop by a single instruction. – DanielKO Sep 06 '13 at 05:39
  • why is comparing performance without optimizations meaningless? I didn't use -O2 and the timings reflect that the loop is still being executed. Using -O2 it takes around 0.023s to execute with 4 threads. Therefore creating and joining the 4 threads does not have 1.5 seconds of overhead. – user816318 Sep 06 '13 at 05:47
  • 4
    That `-O2` changes a lot of things, not just the loop. You could also be having a [false sharing](http://en.wikipedia.org/wiki/False_sharing) issue, where each CPU is updating distinct positions that all fall into the same cache line; so the more cores you put to work, the slower it gets. Try spreading the elements of `g` so they are at least 64 bytes apart, see if there's any change. And even when you deal with possible cache problems, you will still be bound to RAM bandwidth (your code in particular will probably be memory I/O bound). – DanielKO Sep 06 '13 at 05:58
  • you were right, it is false sharing. Is there a way to avoid this other than manually padding the code with dummy variables? – user816318 Sep 06 '13 at 06:12
  • sweet, I changed the loop so it works with variable on the stack and it's fixed. – user816318 Sep 06 '13 at 06:22

4 Answers4

5

The execution of your threaded program is being serialised because of extreme case of false sharing while accessing the elements of g. Here is a modified version of your program that evades false sharing and runs for the same amount of time with different number of threads as long as each thread could be assigned to a different CPU core:

#include <iostream>
#include <thread>

#define NUMTHREADS 4
using namespace std;

int g[NUMTHREADS*16];

thread t[NUMTHREADS];

void task1(int x)
{
    if(x+1<NUMTHREADS)
        t[x] = thread(task1, x+1);
    for(int i=0;i<100000000;i++)
        g[x*16]++;
    if(x+1<NUMTHREADS)
        t[x].join();
}

int main()
{
    task1(0);
    for(int i=0;i<NUMTHREADS;i++)
        cout<<g[i*16]<<" ";
}

The run times with 1 and with 4 threads:

$ time ./a.out
100000000
./a.out  0.45s user 0.01s system 98% cpu 0.466 total
                                         ^^^^^^^^^^^
$ time ./a.out
100000000 100000000 100000000 100000000
./a.out  1.52s user 0.01s system 329% cpu 0.462 total
                                          ^^^^^^^^^^^

And here is a short explanation of what happens. Modern x86 CPUs access main memory in blocks of 64 bytes, called cache lines (unless non-temporal store or load instruction are used, but this is not the case here). A single cache line of that size can accommodate up to 16 elements of an int array:

|    single cache line      |  another cache line
|------+------+-----+-------|-------+-------+------
| g[0] | g[1] | ... | g[15] | g[16] | g[17] | ...
+------+------+-----+-------+-------+-------+------
    ^      ^
    |      |
    |      +------ thread 1 updates this element
    |
    +------------- thread 0 updates this element

x86 is a cache-coherent architecture, which means that when a cache line is modified in a single core, the other cores are informed that their copy of the same cache line is no longer valid and has to be reloaded from an upper level memory storage, e.g. the shared L3 cache or the main memory. Since both shared last-level cache and main memory are much slower than the private caches of each core, this leads to much slower execution.

The modified version multiplies the index in g by 16:

|     one cache line        |  another cache line
|------+------+-----+-------|-------+-------+------
| g[0] | g[1] | ... | g[15] | g[16] | g[17] | ...
+------+------+-----+-------+-------+-------+------
    ^                           ^
    |                           |
    |                           +------ thread 1 updates this element
    |
    +------------- thread 0 updates this element

It is now clear that no two threads share the same cache line and hence the cache coherency protocol is not involved in the process.

The same effect is achieved when using stack variables. Thread stacks are usually big (at least several KiB) and aligned on memory page border, hence stack variables in different threads never share the same cache line. Also the compiler further optimises access to stack variables.

See this answer for somewhat more thorough explanation and another way to prevent false sharing. Although it is about OpenMP, the concepts apply to your case too.

Community
  • 1
  • 1
Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
-1

Threads increase the performance when they run on separate cores, parallel to each other. So bind each thread to different cores by setting thread affinity to different cores. Its probably that your threads are running on single core.

so if thread 1,2,3,4 are allocated to diff cores 1 2 3 4 (dont use 0) then all will simulteneously increase the index. See $cpuinfo to see cores of your processor. and use thread->setAffinity(core_numer); to set the core for thread.

rahul.deshmukhpatil
  • 977
  • 1
  • 16
  • 31
  • 1
    Manually setting affinity is not the solution here, and should be avoided except in special cases. It definitely shouldn't be done blindly hoping it will change things, the scheduler will probably do a better job than you will. `std::thread` has no `setAffinity` member anyway. – Jonathan Wakely Sep 06 '13 at 09:04
-1

You should:

  1. compile your code with at least -O2 optimazation.

  2. declare variant g as volatile, otherwise it may be optimized out while compiling.

On my 2-core machine, thread=1 and thread=2 cost almost the same time.

richselian
  • 731
  • 4
  • 18
  • declaring g as volatile will cause program to take more time as optimization will be removed. Anyways he is not modifying same memory from two diff threads. – rahul.deshmukhpatil Sep 06 '13 at 05:25
  • using -O2, the loop will be optimized to just 1 instruction.. so the program will run instantly for any small number of threads – user816318 Sep 06 '13 at 05:33
  • @rahul.deshmukhpatil `volatile` is commonly used for banchmark, as practical program can hardly optimized out your variant, for banchmark you should also make the variant not optimized out. – richselian Sep 09 '13 at 05:00
  • @user816318, `volatile` makes it not optimized. – richselian Sep 09 '13 at 05:01
  • please refer http://en.wikipedia.org/wiki/Volatile_variable#Optimization_comparison_in_C, which says that optimization will be removed so it may take more time to execute the code with volatile variables. – rahul.deshmukhpatil Sep 11 '13 at 08:41
-1

I believe there is quite a lot of overhead in creating and joining threads like that (see this question C++11: std::thread pooled?). Check out something like OpenMP if you want to parallelize for efficiency.

Community
  • 1
  • 1
DrBards
  • 104
  • 6
  • I believe it isn't the thread overhead because I tested the program with different # of loop iterations. using 1 billion instead of 100 million, each of the timings are roughly multiplied by 10. – user816318 Sep 06 '13 at 05:37