0

I created a C++ function that is a part of a bigger project. This function is called a lot. In order to enhance the performance we decided to split that function into 4 parts, each two running in parallel. The complete program takes one input, and one input only, then it does a simulation, it passes a variable of length 2000 to the function in question.

This function operates on the variable (20,096 max operations,150,000 additions, and no multiplications). Those operations are done by func1 and func2 in parallel, twice (so every time each function does quarter of those opperations). both functions share the same input in memory (double Signal of size 700 (read only), double A, B, C, H, (all of size (double) 5600, write and read) ) and output (double L of size 700).

No mutexes are necessary because func1 works on one half of A,B,C,H (read and write), and writes into its half in L, while func2 does the same in its half. However, there are instances where both functions, or threads, are reading Signal at the same time. On the second call, the threads almost do the same operations.

The problem is that the Threaded program runs a bit slower than the serial program. When I time each func alone they run 1/4th of the total function time of the original function time, which makes sense as func1 is called twice, and func2 is called twice as well. I use clock_t clock() for timing (This measures wall clock in windows, not as specified in the standard). but that was coherent with other timing tools like windows QueryPerformanceCounter.

I timed everything, and tried everything I saw. I used the optimizing optoins -O3 O2 Ofast. I created a separate memory for each thread (even for the read only arrays, then copied results).

I have two theories in mind 1- overhead of pthreads is taking as much time as the functions are taking 2- main() is sleeping while waiting for pthread_join.

I am more convinced with theory 2 because they only place time is lost is the somewhere in the pthread_join.

I wrote this sample code to simulate the problem. please note that the loop positions are essential in the algorithm I am implementing, so moving operations to use less loops will not work.

Note that if you increment the size of data (j<10000 and j<5000) and decrease the count range correspondingly, the performance of the threaded program begins to perform better.

This runs in 1.3 seconds.

#include <math.h>
#include <pthread.h>
#include <iostream>
#include <time.h>
using namespace std;

int main(){
    int i,m,j,k;

    clock_t time_time;
    time_time=clock();

    for (int count =0 ; count<50000;++count){
        for (j=0;j<10000;j++){
            m=j;
            k=j+1;
            i=m*j;
        }
    }
    cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
}

This runs in 5 seconds on the same processor.

#include <math.h>
#include <pthread.h>
#include <iostream>
#include <time.h>

using namespace std;

void test (int i);

void *thread_func(void *arg){
    int idxThread = *((int *) arg);
    test (1);
    return NULL;
}    

void test (int i){  
    int j,k,m;
    int q=0,w=1,e=2,r=3,t=4;
    int a=1,s=1,d=1,f=3,g=3;
    for (j=0;j<5000;j++){
        m=j;
        k=j+1;
        i=m*j;
    }
}

int main(){
    int numThreads=2;

    clock_t time_time;
    pthread_t threads[numThreads];
    unsigned int threadIDs[numThreads];
    time_time =clock();

    for (int count =0 ; count<50000;++count){
        for (unsigned int id = 0; id < numThreads; ++id)
        {
            threadIDs[id]=id;
            pthread_create(&(threads[id]), NULL, thread_func, (void *) &(threadIDs[id]));
        }
        for (unsigned int id = 0; id < numThreads; ++id)
        {
            pthread_join(threads[id], NULL);
        }
    }
        cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
}

EDIT: The 50000 calls to the thread function is to illustrate the problem, in my code, they are just 2 calls for func1, and func2, twice, which is 4 creations and joins. which seems to take 2 milliseconds.

OS: windows, mingw32, pthreads C++. CPU i7, RAM:8Gb

makefile: 
CC = g++ -O3 -I............ -Wformat -c 
LINK = g++ -Wl,--stack,8388608 -o
LINKFLAGS = -lpthread
user304584
  • 487
  • 1
  • 3
  • 14
  • Well your test code does nothing from the compiler's point of view, since you call your test function with a constant argument, so there are chances all your computation code gets optimized away. What you measure, more or less, is the time needed to create and destroy a thread, which seems to be reasonable (around 50us). Creating threads is costly. It's not a good idea to spend your time creating and destroying them if you want fast code. – kuroi neko Sep 07 '15 at 00:46
  • If you look at the loop, you're creating and joining 50000 * numThreads threads serially, that's going to be quite expensive. – melak47 Sep 07 '15 at 00:47
  • I understand that there is an overhead associated with destroying and creating threads, but it is not the problem as I think. I think its something else. 50000 is just an example to illustrate the issue. – user304584 Sep 07 '15 at 00:51
  • You are starting a thread to run just one iteration of `test()`. That's pretty wasteful, since your text description sounded like you were actually trying to make each of the two threads process their half of the inputs. – melak47 Sep 07 '15 at 00:54
  • Here's an example of what I thought you wanted to do (hope you don't mind the C++11) http://coliru.stacked-crooked.com/a/e4f8bfd9e2daa832 – melak47 Sep 07 '15 at 01:08
  • I cannot chunk the arrays into chunks. In my code I only call the threads 4 times. I think that the operations I am implementing are not significantly larger than the over head of creating a thread, or that the main is sleeping, which causes delay. sorry my explanations are bad. but thanks. – user304584 Sep 07 '15 at 01:22
  • Take a look at my modified example and the times it produces - 2.28 seconds for 100000 runs of `test`. So each run of test averages about 22 *micro*seconds. Now let's measure how long it takes to [start and join a thread](http://coliru.stacked-crooked.com/a/bc015b8e79c66c10) - about 88 microseconds. – melak47 Sep 07 '15 at 01:31
  • If your code here does not represent your actual problem well, we can't really help you with that... – melak47 Sep 07 '15 at 01:36
  • I was doing exactly what you did there, and I got to the same conclusion, the thread_create/join is taking exactly the same time as the complete serial code. which tells me that threads are not the way to go. Thanks for your help . – user304584 Sep 07 '15 at 01:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/88965/discussion-between-melak47-and-user304584). – melak47 Sep 07 '15 at 01:44

2 Answers2

0

As @melak47 illustrated in the comments, the overhead necessary to create and join the threads takes more time than each code execution in the thread itself.

user304584
  • 487
  • 1
  • 3
  • 14
0
  1. Don't create and join threads. Keep a pool of threads running and assign them tasks as needed.

  2. Don't wait for tasks to complete unless you have no choice. Instead, have the completion of the task trigger work to be done without waiting.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278