Why is multithreading slower than sequential programming in my case?

Question

I'm new to multithreading and try to learn it through a simple program, which adds 1 to n and return the sum. In the sequential case, the main call the sumFrom1 function twice for n = 1e5 and 2e5; in the multithreaded cases, two threads are created using pthread_create and two sums are calculated in separate thread. The multithreadting version is much slower than the sequential version (see results below). I run this on a 12-CPU platform and there are no communication between threads.

Multithreaded:

Thread 1 returns: 0 
Thread 2 returns: 0 
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 156 seconds

Sequential:

sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 56 seconds

When I add -O2 in compilation, the time of multithreaded version (9s) is less than that of sequential version (11s), but not much as I expect. I can always have the -O2 flag on but I'm curious about the low speed of multithreading in the unoptimized case. Should it be slower than sequential version? If not, what can I do to make it faster?

The code:

#include <stdio.h>
#include <pthread.h>
#include <time.h>

typedef struct my_struct
{
  int n;                                                                                                                                                              
  int sum;                                                                                                                                                            
}my_struct_t;                                                                                                                                                         

void *sumFrom1(void* sit)                                                                                                                                              
{                                                                                                                                                                     
  my_struct_t* local_sit = (my_struct_t*) sit;                                                                                                                          
  int i;                                                                                                                                                              
  int nsim = 500000;  // Loops for consuming time                                                                                                                                                
  int j;                                                                                                                                                              

  for(j = 0; j < nsim; j++)                                                                                                                                           
  {                                                                                                                                                                   
    local_sit->sum = 0;                                                                                                                                                
    for(i = 0; i <= local_sit->n; i++)                                                                                                                                 
      local_sit->sum += i;                                                                                                                                             
  }    
}

int main(int argc, char *argv[])                                                                                                                                      
{                                                                                                                                                                     
  pthread_t    thread1;                                                                                                                                               
  pthread_t    thread2;                                                                                                                                               
  my_struct_t  si1;                                                                                                                                                   
  my_struct_t  si2;                                                                                                                                                   
  int          iret1;                                                                                                                                                 
  int          iret2;                                                                                                                                                 
  time_t       t1;                                                                                                                                                    
  time_t       t2;                                                                                                                                                    


  si1.n = 10000;                                                                                                                                                      
  si2.n = 20000;                                                                                                                                                      

  if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version                                                                                                                                
  {                                                                                                                                                                   
    t1 = time(0);                                                                                                                                                     
    iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);      
    iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);                                                                                                     
    pthread_join(thread1, NULL);                                                                                                                                      
    pthread_join(thread2, NULL);                                                                                                                                      
    t2 = time(0);                                                                                                                                                     

    printf("Thread 1 returns: %d\n",iret1);                                                                                                                           
    printf("Thread 2 returns: %d\n",iret2);                                                                                                                           
    printf("sum of 1..%d: %d\n", si1.n, si1.sum);                                                                                                                     
    printf("sum of 1..%d: %d\n", si2.n, si2.sum);                                                                                                                     
    printf("time: %d seconds", t2 - t1);                                                                                                                              

  }                                                                                                                                                                   
  else     // Use "./prog" to test the time of sequential version                                                                                                                                                           
  {                                                                                                                                                                   
    t1 = time(0);                                                                                                                                                     
    sumFrom1((void*)&si1);                                                                                                                                            
    sumFrom1((void*)&si2);                                                                                                                                            
    t2 = time(0);                                                                                                                                                     

    printf("sum of 1..%d: %d\n", si1.n, si1.sum);                                                                                                                     
    printf("sum of 1..%d: %d\n", si2.n, si2.sum);                                                                                                                     
    printf("time: %d seconds", t2 - t1); 
  }                                                                                             
  return 0;                                                                                         
}

UPDATE1:

After a little googling on "false sharing" (Thanks, @Martin James!), I think it is the main cause. There are (at least) two ways to fix it:

The first way is inserting a buffer zone between the two structs (Thanks, @dasblinkenlight):

my_struct_t  si1;
char         memHolder[4096];
my_struct_t  si2;

Without -O2, the time consuming decreases from ~156s to ~38s.

The second way is avoiding frequently updating sit->sum, which can be realized using a temp variable in sumFrom1 (as @Jens Gustedt replied):

for(int sum = 0, j = 0; j < nsim; j++)              
{
  sum = 0;
  for(i = 0; i <= local_sit->n; i++)
    sum += i;
}
local_sit->sum = sum;

Without -O2, the time consuming decreases from ~156s to ~35s or ~109s (It has two peaks! I don't know why.). With -O2, the time consuming stays ~8s.

In such tests, we need to average out the results. How many times did you run the tests with -O2 optimization? And if you've ran multiple times, whats the avg times? — Pavan Manjunath, Apr 11 '12 at 09:23
@PavanManjunath Thanks for the advice. I ran 10 times with -O2. the avg times are 7.9s for the multithreaded version and 11.7s for the sequential one. The fluctation is small. — cogitovita, Apr 11 '12 at 09:37
see http://stackoverflow.com/questions/3000456/multi-threaded-random-r-is-slower-than-single-threaded-version/3000585#3000585 though I don't see that 7.9s is 'much slower than' 11.7s — Pete Kirkham, Apr 11 '12 at 09:51
@PeteKirkham Without -O2, multithreading (156s) is 'much slower than' than single-threading (56s). — cogitovita, Apr 11 '12 at 09:58
@cogitovita try inserting a buffer zone between the two structs to see if false sharing has a role. — Sergey Kalinichenko, Apr 11 '12 at 10:06
I usually put a char[128] dummy buffer at the front of my structs. Also, you might need to actually do something with that buffer to prevent the compiler optimizing it away - initialize the first and last elements to $FF or something. — Martin James, Apr 11 '12 at 10:16

score 3 · Accepted Answer · answered Apr 11 '12 at 10:26

By modifying your code to

typedef struct my_struct
{
  size_t n;
  size_t sum;
}my_struct_t;

void *sumFrom1(void* sit)
{
  my_struct_t* local_sit = sit;
  size_t nsim = 500000;  // Loops for consuming time
  size_t n = local_sit->n;
  size_t sum = 0;
  for(size_t j = 0; j < nsim; j++)
  {
    for(size_t i = 0; i <= n; i++)
      sum += i;
  }
  local_sit->sum = sum;
  return 0;
}

the phenomenon disappears. The problems you had:

using int as a datatype is completely wrong for such a test. Your figures where such that the sum overflowed. Overflow of signed types is undefined behavior. You are lucky that it didn't eat your lunch.
having bounds and summation variables with indirection buys you additional loads and stores, that in case of -O0 are really done as such, with all the implications of false sharing and stuff like that.

Your code also observed other errors:

a missing include for atoi
superflouous cast to and from void*
printing of time_t as int

Please compile your code with -Wall before posting.

Using `size_t sum = 0;`results in a significant preformance boost. Then adding this `size_t n = local_sit->n;` slows things down again. Any ideas why? (all compiled with -O0) — alk, Apr 11 '12 at 11:04
No, not really. Discussing non-optimized code to that detail makes not much sense, I think. If you want to know what is really going on, a first step would be to look into the assembler that is generated with `-S`. — Jens Gustedt, Apr 11 '12 at 11:18

Why is multithreading slower than sequential programming in my case?

1 Answers1