Time increasing with increased number of threads

Question

I was trying to calculate the speedup and optimal number of threads for matrix addition but the parallel execution always takes more time than sequential and increases keeps on increasing till about 8 threads and then becomes kind of constant. Can anyone help me figure out why?

The sequential code:

#include <stdlib.h>    
#include <stdio.h>     
#include <time.h>  
 
 
int main (int argc, char *argv[])  
{ 
 
    int ARRAY_SIZE;     
    int n = 10000;                  
    int n_per_thread;                     
    int i,j;       
    int *a[n];
    int *b[n]; 
    int *c[n];  
    for (i=0; i<n; i++){
         a[i] = (int *)malloc(n * sizeof(int)); 
         b[i] = (int *)malloc(n * sizeof(int));
         c[i] = (int *)malloc(n * sizeof(int)); 
    }     
    for(i=0; i<n; i++) {
        for(j=0;j<n;j++){ 
        a[i][j] = 1;
      } 
    } 
    for(i=0; i<n; i++) {
        for(j=0;j<n;j++){ 
        b[i][j] = 1;
      } 
    }    
    clock_t t;  
    t = clock();  
    for(i=0; i<n; i++) {
        for(j=0;j<n;j++){ 
        c[i][j] = a[i][j]+b[i][j];
      } 
    }
    t = clock() - t;  
    double time_taken = ((double)t)/CLOCKS_PER_SEC; 
    printf("Time taken by sequential for matrix size %d: ",n); 
    printf("%f%s\n",time_taken," seconds"); 
    return 0; 
}

The parallel code:

#include <stdlib.h>    
#include <stdio.h>     
#include <omp.h>       
#include <time.h>  
 
#define NUM_THREADS 10  
 
int main (int argc, char *argv[])  
{ 
 
    int ARRAY_SIZE; 
    int n = 10000;           
    int n_per_thread;                    
    int total_threads = NUM_THREADS;       
    int i,j;       
    int *a[n];
    int *b[n]; 
    int *c[n];  
    for (i=0; i<n; i++){
         a[i] = (int *)malloc(n * sizeof(int)); 
         b[i] = (int *)malloc(n * sizeof(int));
         c[i] = (int *)malloc(n * sizeof(int)); 
    }     
    for(i=0; i<n; i++) {
        for(j=0;j<n;j++){ 
        a[i][j] = 1;
      } 
    } 
    for(i=0; i<n; i++) {
        for(j=0;j<n;j++){ 
        b[i][j] = 1;
    }
    }          
    omp_set_num_threads(total_threads); 
    n_per_thread = n/total_threads; 
    clock_t t;  
    t = clock();  
    #pragma omp parallel for shared(a, b, c) private(i) schedule(static, n_per_thread) 
        for(i=0; i<n; i++) {
            for(j=0;j<n;j++){ 
                c[i][j] = a[i][j]+b[i][j];
      } 
    }
    t = clock() - t;  
    double time_taken = ((double)t)/CLOCKS_PER_SEC; 
    printf("Time taken by parallel for vector size %d: ",n); 
    printf("%f%s\n",time_taken," seconds"); 
    return 0; 
}

@user253751 Thank ou for the reply. I will use omp_get_wtime(). — minnieme, Jul 24 '20 at 15:59
Assumptions about threads and parallel operations may not be altogether right. There is a brief discussion these topics [here](https://stackoverflow.com/questions/19324306/running-two-threads-at-the-same-time/19324717#19324717). — ryyker, Jul 24 '20 at 16:07
Do you have compiler optimizations active? Is this a debug or release build? — Andreas Wenzel, Jul 24 '20 at 16:33
In your program, you are using 10 threads of execution. However, how many physical and logical CPU cores does your computer have? — Andreas Wenzel, Jul 24 '20 at 16:59
And do you realize that, since you do nothing with your arrays after you do the calculations, the compiler is free to elide your entire calculation loop that you're trying to time? — Andrew Henle, Jul 24 '20 at 22:25

Andreas Wenzel · Answer 1 · 2020-07-25T19:18:06.437

The reason why multithreading won't help much is probably because you are bottlenecked by memory bandwidth, not CPU speed.

Modern desktop computers have a memory bandwidth of about 20 GB/s. Assuming sizeof(int) == 4, that means 5 billion ints can be transferred per second in memory. Since every one of your arithmetic operations reads 2 ints and writes 1, that means a modern desktop computer has sufficient memory bandwidth for 1,7 bilion of these arithmetic operations per second. Therefore, a modern desktop computer has enough memory bandwidth to run your program with n = 40000 in one second.

Unless you are using a NUMA architecture, using multi-threading will only increase the potential computational speed, but it will not increase your memory bandwidth.

The single-threaded version of your loop code probably already gets vectorized by the compiler, which means it uses SIMD instructions. At least that is what my compiler does when I compile your single-threaded code with compiler optimizations active. That way, the code of the single-threaded version is already parallelized to a large extent.

Further parallelizing the code is of course possible by using multi-threading. But your calculations are rather simple, because you perform only one addition on every piece of data. Therefore, it is likely that your vectorized single-threaded version is bottlenecked not so much by the computational power of your CPU, but rather by your memory bandwidth. See the following StackOverflow question for further information on how vectorization of a loop will sometimes not give you a performance benefit, if it is bottlenecked by memory bandwidth:

Why vectorizing the loop does not have performance improvement

If your calculations are so simple that even the single-threaded (vectorized) version of your code gets bottlenecked by memory bandwidth and not CPU speed, then you will not profit much from adding even more computational power by introducing multi-threading.

Time increasing with increased number of threads

1 Answers1