Why does similar code running in multiple threads have different running times?

Question

I met a very strange problem about a C++ multi-thread program which as belows.

#include<iostream>
#include<thread>
using namespace std;
int* counter = new int[1024];
void updateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        counter[position] = counter[position] + 8;
    }
}
int main() {
    time_t begin, end;
    begin = clock();
    thread t1(updateCounter, 1);
    thread t2(updateCounter, 2);
    thread t3(updateCounter, 3);
    thread t4(updateCounter, 4);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
    end = clock();
    cout<<end-begin<<endl;  //1833
    begin = clock();
    thread t5(updateCounter, 16);
    thread t6(updateCounter, 32);
    thread t7(updateCounter, 48);
    thread t8(updateCounter, 64);
    t5.join();
    t6.join();
    t7.join();
    t8.join();
    end = clock();
    cout<<end-begin<<endl;   //358

}

the first code block run about 1833 seconds,but the second which is almost same with the first one run about 358 seconds.Beg for an answer!Thank you!

On an unrelated note, why `int* counter = new int[1024];` instead of just plain `int counter[1024];`? Especially considering that your dynamic allocation will leave the memory *uninitialized* and with *indeterminate* values, leading to *undefined behavior* when you use these values. Defining a plain array in global scope will make sure the data is initialized to zero. — Some programmer dude, Mar 13 '22 at 10:18

score 5 · Answer 1 · answered Mar 13 '22 at 09:57

Writing to nearby variables from multiple threads is slow due to "false sharing" which is described here: What is "false sharing"? How to reproduce / avoid it?

Your offsets of 16/32/48/64 are 64 bytes apart because the int values are (on most common platforms) 4 bytes each. And 64 bytes is a common cache line size, so this puts each target value on its own cache line.

The performance difference is not nearly as large if you compile with optimization. Which of course you should always do when measuring performance. But there's still a difference, and it may get worse the more threads you have.

Finally, your benchmark is unfair because you always run the "slow" code first. That means the code and data are "cold" for the first experiment and "hot" for the second experiment. This is a common mistake in benchmarking, and may even be the dominant factor in the performance difference you're seeing, depending on your system.

Why does similar code running in multiple threads have different running times?

1 Answers1