-1

I am learning multi-thread programming. I write a simple program but the performance of my program decreases as the number of threads increases. I want to measure my program after all thread done thread_local_init(), so I use the flag to synchronize.

But the value of elapsed.count() is increasing as the num of thread increases.

Here is my code:

#define THREADS 64 //32 //16 //8 //2
#define OPS 100*1024*1024
std::atomic<int> flag(0);
__thread int init[1024];
uint64_t* ptr_array;
std::chrono::time_point<std::chrono::system_clock> start;
void thread_local_init(){
    for(int i=0;i<1024;i++) init[i]=i;
}
void do_alloc(int id){
    thread_local_init();
    flag++;
    uint64_t each_ops = OPS / THREADS;
    uint64_t w_start = each_ops * id;
    uint64_t w_end = each_ops * (id+1);
    while(flag!=THREADS){
        
    }
    start = std::chrono::system_clock::now();
    for(uint64_t i=w_start;i<w_end;i++){
        ptr_array[i] = i;
    }

}
int main(int argc, char** argv){
    std::thread threads[THREADS];
    ptr_array = (uint64_t*)malloc(sizeof(ptr_array)*OPS);
    for (int i = 0; i < THREADS; i++) {
        threads[i] = std::thread(do_alloc,i);
    }
    for (auto& t: threads) {
        t.join();
    }
    auto end = std::chrono::system_clock::now();
    auto elapsed = end - start;
    std::cout << elapsed.count() << '\n';
}
tuffy chow
  • 117
  • 1
  • 7
  • In fact, there is an initialization before the loop, so I don't want to count it to the measurement. My hardware threads are 72. – tuffy chow Feb 02 '21 at 12:22
  • 1
    Did you set _one_ clock before starting the threads and compare with the time after the threads are done without having any synchronization in the threads? Unless the threads are short-lived, I don't see why that would increate time. In those cases I usually get up to X times _shorter_ execution time where X is the number of available hardware threads. – Ted Lyngmo Feb 02 '21 at 15:33
  • 1
    I made [this test program](https://godbolt.org/z/f373xd) to run on my machine. It runs for a long time (first test may run for 30-60 seconds), the second should run in half that time and then it'll flatten out until there is no improvement to create more threads. In the summare that is printed when the program ends it's clear that on my old 6 core (12 hyperhtreads) Xeon, using half of the reported hardware threads gave the best throughput - that is, 1 thread per core. – Ted Lyngmo Feb 02 '21 at 16:19
  • 1
    Here's an [updated version](https://godbolt.org/z/h4KerK) of the test program. I noticed that `clang++` was actually able to figure out the result of those loops and made them into constants - so here I've replaced it with randomized calculations. For me, this improves with every thread I add - until I reach _hardware_threads - 1_. – Ted Lyngmo Feb 02 '21 at 17:15

1 Answers1

0

Why does the performance of my program decreases as the number of threads increases?

For hardware reasons, probably. I assume your run some Linux on some x86/64 processor.

Processors have a hardware defined amount of cores. Depending on the price of your processor, it could have from two to 64 cores. On Linux, you could use lscpu command, or cat /proc/cpuinfo to find out. See proc(5).

Your threads are all CPU intensives. So you won't gain much by running more threads than the number of cores you have.

Also your threads are accessing some common RAM, and CPU caches then matter. Read about cache coherency protocols.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547