The execution of your threaded program is being serialised because of extreme case of false sharing while accessing the elements of g
. Here is a modified version of your program that evades false sharing and runs for the same amount of time with different number of threads as long as each thread could be assigned to a different CPU core:
#include <iostream>
#include <thread>
#define NUMTHREADS 4
using namespace std;
int g[NUMTHREADS*16];
thread t[NUMTHREADS];
void task1(int x)
{
if(x+1<NUMTHREADS)
t[x] = thread(task1, x+1);
for(int i=0;i<100000000;i++)
g[x*16]++;
if(x+1<NUMTHREADS)
t[x].join();
}
int main()
{
task1(0);
for(int i=0;i<NUMTHREADS;i++)
cout<<g[i*16]<<" ";
}
The run times with 1 and with 4 threads:
$ time ./a.out
100000000
./a.out 0.45s user 0.01s system 98% cpu 0.466 total
^^^^^^^^^^^
$ time ./a.out
100000000 100000000 100000000 100000000
./a.out 1.52s user 0.01s system 329% cpu 0.462 total
^^^^^^^^^^^
And here is a short explanation of what happens. Modern x86 CPUs access main memory in blocks of 64 bytes, called cache lines (unless non-temporal store or load instruction are used, but this is not the case here). A single cache line of that size can accommodate up to 16 elements of an int
array:
| single cache line | another cache line
|------+------+-----+-------|-------+-------+------
| g[0] | g[1] | ... | g[15] | g[16] | g[17] | ...
+------+------+-----+-------+-------+-------+------
^ ^
| |
| +------ thread 1 updates this element
|
+------------- thread 0 updates this element
x86 is a cache-coherent architecture, which means that when a cache line is modified in a single core, the other cores are informed that their copy of the same cache line is no longer valid and has to be reloaded from an upper level memory storage, e.g. the shared L3 cache or the main memory. Since both shared last-level cache and main memory are much slower than the private caches of each core, this leads to much slower execution.
The modified version multiplies the index in g
by 16:
| one cache line | another cache line
|------+------+-----+-------|-------+-------+------
| g[0] | g[1] | ... | g[15] | g[16] | g[17] | ...
+------+------+-----+-------+-------+-------+------
^ ^
| |
| +------ thread 1 updates this element
|
+------------- thread 0 updates this element
It is now clear that no two threads share the same cache line and hence the cache coherency protocol is not involved in the process.
The same effect is achieved when using stack variables. Thread stacks are usually big (at least several KiB) and aligned on memory page border, hence stack variables in different threads never share the same cache line. Also the compiler further optimises access to stack variables.
See this answer for somewhat more thorough explanation and another way to prevent false sharing. Although it is about OpenMP, the concepts apply to your case too.