C multithread performance issue

Question

I am writing a multi-threaded program to traverse an n x n matrix, where the elements in the main diagonal are processed in a parallel manner, as shown in the code below:

int main(int argc, char * argv[] )
{   
  /* VARIABLES INITIALIZATION HERE */

  gettimeofday(&start_t, NULL); //start timing
  for (int slice = 0; slice < 2 * n - 1; ++slice)
  {  
    z = slice < n ? 0 : slice - n + 1;
    int L = 0;
    pthread_t threads[slice-z-z+1];
    struct thread_data td[slice-z-z+1];

    for (int j=z; j<=slice-z; ++j)
    {
      td[L].index= L;
      printf("create:%d\n", L );
      pthread_create(&threads[L],NULL,mult_thread,(void *)&td[L]);
      L++;
    }

    for (int j=0; j<L; j++) 
    {
      pthread_join(threads[j],NULL);
    }
  }     

  gettimeofday(&end_t, NULL); 
  printf("Total time taken by CPU: %ld \n", ( (end_t.tv_sec - start_t.tv_sec)*1000000 + end_t.tv_usec - start_t.tv_usec));

  return (0);
}

void *mult_thread(void *t)
{      
  struct thread_data *my_data= (struct thread_data*) t;

  /* SOME ADDITIONAL CODE LINES HERE */ 

  printf("ThreadFunction:%d\n", (*my_data).index );

  return (NULL);
}

The problem is that this multithreaded implementation gave me a very bad performance compared with the serial (naive) implementation.

Are there some adjustments that could be done to improve the performance of the multithreaded version ??

Have you attempted to identify those parts of the code that might benefit from multithreaded execution? Adding more threads to a program doesn't automatically make it faster. — Robert Harvey, Apr 19 '15 at 22:34
is the task done by each thread large enough to overcome the overhead? Also it might be an idea to use a maximum amount of threads because your system can handle only 4, 8, 16 etc threads — Simon, Apr 19 '15 at 22:39
I could imagine that the single-threaded version profits from caching in a way the multi-threaded one can't because memory accesses come in randomly. Have a look at [this question](http://stackoverflow.com/questions/4802565/multiple-threads-and-cpu-cache). Maybe try storing the matrix as diagonals? — dddsnn, Apr 19 '15 at 22:54
Create a new thread also have some cost . if the cost in "SOME ADDITIONAL CODE LINES HERE" is not much higher than the cost in thread creating , it may even make the performance worse.You may create some thread first (2 * cpu thread , in my experience),and write you task to a queue or some other structure. — Lumen, Apr 19 '15 at 23:29
@Simon It is not that larg, but I thought that assigning the task that should be done by each cell to a separate thread could raises the performance. However, how can I use the maximum amount of threads as you mentioned? — MROF, Apr 19 '15 at 23:59
You are missing the point that @RobertHarvey and the others are making. Using multithreading does not automatically increase performance. And in some cases it will even cause performance degradation. Show us the "ADDITIONAL CODE LINES". Without that no one can tell you whether there is something you are doing wrong that results in no performance gain or whether the problem you are trying to solve does not benefit from multithreading. — kaylum, Apr 20 '15 at 00:06
The system "can handle" large amounts of threads, but they won't make it faster. A good start for tests is the number of cores in your CPU(s), because with a grain of salt two threads on the same core won't run any faster combined than a single thread alone; but creating the threads needs quite a lot of time, so the overall result will be *slower*. Creating a thread only pays if there is enough work to do for it in any case. — Peter - Reinstate Monica, Apr 20 '15 at 00:46
the parameters to main(), argc and argv[] are not used. This results in the compiler raising two warnings. Perhaps you meant to use: 'int main(void)' The compiler should always have all warnings enabled. Then those warnings need to be corrected in gcc, that means using (at least) '-Wall -Wextra -pedantic' — user3629249, Apr 20 '15 at 02:14
threads should be exited with 'thread_exit()' not return (NULL) — user3629249, Apr 20 '15 at 02:16
@PeterSchneider A CPU core with hyperthreading can run a program with two threads and a lot of memory access a lot faster then with one thread, there is almost no cost switching from one to another thread but there are a lot of times one thread is waiting for data — Simon, Apr 20 '15 at 23:22
@user3629249 From http://pubs.opengroup.org/onlinepubs/007908775/xsh/pthread_exit.html: "An implicit call to pthread_exit() is made when a thread other than the thread in which main() was first invoked returns from the start routine that was used to create it. The function's return value serves as the thread's exit status." I think a return is fine, even if NULL is a bit off, but fine as well after some coercing.... — Peter - Reinstate Monica, Apr 21 '15 at 04:58
@Simon Oh, a lot? My experience never was more than a grain of salt ;-). What was your best improvement? — Peter - Reinstate Monica, Apr 21 '15 at 04:59

Lumen · Answer 1 · 2015-04-23T13:43:47.123

a thread pool may make it better.

define a new struct type as follow.

typedef struct {
    struct thread_data * data;
    int status; // 0: ready 
                // 1: adding data 
                // 2: data handling, 3: done
    int next_free;
} thread_node;

init :

size_t thread_size = 8;
thread_node * nodes = (thread_node *)malloc(thread_size * sizeof(thread_node));
for(int i = 0 ; i < thread_size - 1 ; i++ ) {
    nodes[i].next_free = i + 1;
    nodes[i].status = 0 ; 
}
nodes[thread_size - 1].next_free = -1;
int current_free_node = 0 ;
pthread_mutex_t mutex;

get thread :

int alloc() {
    pthread_mutex_lock(&mutex);
    int rt = current_free_node;
    if(current_free_node != -1) {
        current_free_node = nodes[current_free_node].next_free;
        nodes[rt].status = 1;
    }
    pthread_mutex_unlock(&mutex);
    return rt;
}

return thread :

void back(int idx) {
    pthread_mutex_lock(&mutex);
    nodes[idx].next_free = current_free_node;
    current_free_node = idx;
    nodes[idx].status = 0;
    pthread_mutex_unlock(&mutex);
}

create the threads first, and use alloc() to try to get a idle thread, update the pointer.

don't use join to judge the status.
modify your mult_thread as a loop and after the job finished , just change your status to 3
for each loop in the thread , you may give it more work

I wish it will give you some help.

------------ UPDATED Apr. 23, 2015 -------------------

here is a example.

compile & run with command $ g++ thread_pool.cc -o tp -pthread --std=c++

yu:thread_pool yu$ g++ tp.cc -o tp  -pthread --std=c++11 && ./tp
1227135.147 1227176.546 1227217.944 1227259.340...
time cost 1 : 1068.339091 ms
1227135.147 1227176.546 1227217.944 1227259.340...
time cost 2 : 548.221607 ms

you may also remove timer and it can also compiled as a std c99 file.

In current , the thread size has been limited to 2. You may also adjust the parameter thread_size, and recompile & run again. More threads may give your some more advantage(in my pc, if I change the thread size to 4, the task will finish in 280ms), while too much thread number may not help you too much if you have no enough cpu thread.

As i am new in pthread programming, a faced some difficulties to understand some parts of your suggested updates: - Are you suggesting to use a fixed number of threads, which is 8 ? - How to call the functions alloc and back ? I would appreciate if you can explain the questions above — MROF, Apr 21 '15 at 11:56
@MROF may I know the final result ? if it can answer your question, could you please click the 'accept' button? — Lumen, Apr 24 '15 at 10:13

C multithread performance issue

1 Answers1