Fastest communication between threads in C

Question

This question is similar to this one about threads, but different.

I have a program that splits a time consuming calculation into different jobs that can be run in parallel. Specifically there is step A and then step B, which repeat multiple times. So the program runs ABABAB... Etc.

So I have one thread that controls things and its job is to tell N other threads to start step A and then to start step B. It is important to only start the next step after all threads of the previous step have finished.

So pseudo code might be something like

 for (n=0;n<N;n++)
   Set variable in common memory to 1 for each thread to start step A;
 do {
  sum = 0; add variable for each thread to sum in a loop
   } while (sum!=0)
/* Threads set their variable to 0 when their calculations are complete*/
... < Similar code to start and monitor step B>

So my question is... ... Is there a better way to send messages between threads (I guess the answer is yes)... Are any of the better ways faster or at least as fast? (I guess the answer may be no.. but really interested to hear any thoughts about this)

Oh and should have said the aim is to have each process locked onto a single core for speed.

Edit

Many thanks for all the comments and the suggestion to look at C pthread synchronize function - that question is very relevant - it does not, however, answer the question about speed. - In the answer of @Willis he makes the point about all the threads spinning constantly to check the status of variables in common memory... and comments mention the potential issues of race conditions and difficulty of debugging....

.... Despite the potential drawbacks of the method I have tried to outline in the pseudo code I am interested to know if the suggested 'barrier' method in pthread.h will be as fast, faster or slower than the method in the pseudo code.

Maybe I could have been clearer in the original question above... that the steps A and B will need to be repeated 1e6 to 1e9 times - or as many times as possible in reasonable time and so I am very wary of any potential hold up in synchronising the threads.... Does this make sense?

Look into thread barriers, looks like that might be the right tool — Mat, Jul 13 '22 at 15:15
Better than what? Your code does not show how you synchronize the threads. But in general, it looks like you are looking for a *barrier* https://stackoverflow.com/questions/12907479/c-pthread-synchronize-function — Eugene Sh., Jul 13 '22 at 15:16
Thread barriers, signals, and/or semaphores would all work for this kind of synchronization if implemented carefully, but **definitely** don't implement this by storing synchronization data in plain common memory. That's a recipe for race conditions, and busy spinlocks which is not what you want — Willis Hershey, Jul 13 '22 at 15:26
You can wait for each thread in turn to finish, using a condition variable per thread. All sensible methods are usually equally fast. — user253751, Jul 13 '22 at 15:41
@EugeneSh. - the code is meant to show a 'master' thread that starts N other threads by setting a variable to 1 - the other threads see their variable change, do their calculations and then set the variable to 0 - the master thread monitors the variables and waits until they are all zero before kicking off the next cycle of calculations. ... does that make sense? — tom, Jul 13 '22 at 23:59
In response to edit: the fastest way to synchronize threads is with library functions because otherwise you’re just going to be wasting clock cycles checking for a signal over and over again. Bypassing the OS for concurrency is never a faster solution. I invite you to do an experiment and see for yourself — Willis Hershey, Jul 14 '22 at 00:00
By “huge waste of CPU resources” in my answer I mean waste of clock-cycles which means slower execution of the code — Willis Hershey, Jul 14 '22 at 00:02
Ah yes, @WillisHershey - but each thread should be running on its own core... I agree if all threads were waiting in turn for the CPU that there would be no advantage and things would slow down... the aim is to run the calculation completely in parallel with say 10 or 20 cores each dedicated to a single thread,... does that make sense? — tom, Jul 14 '22 at 00:05
@tom, without use of proper synchronization objects, the threads *might not* observe the variable to change. And if you having them busy waiting to try to observe such a change, then you are wasting a lot of CPU. That could easily slow all the threads still trying to make progress to the synchronization point. First make it **right**. Only then consider making it faster. — John Bollinger, Jul 14 '22 at 00:08
The problem with that logic is that when a thread is finished, its core is going to be completely useless until the others have finished. Don’t try to outsmart the OS scheduling algorithm, it is tuned for precision on your machine — Willis Hershey, Jul 14 '22 at 00:08
Also, if your thread activity is so fine-grained that the IPC costs are non-negligible, then that granularity is the problem to solve. — John Bollinger, Jul 14 '22 at 00:15
@WillisHershey yes, the issue is that the calculation of each step, e.g. A , needs to be completed before the next step, e.g. B, can be started. I agree that it could be that cores will be idle waiting for other threads to finish... ... and so one part of the code at the beginning makes a plan to divide up all the calculations for step A into approximately equal chunks of work to be spread over the number of cores that are available. Yes cores could be 'idle' waiting for the next instruction to start calculation. + many thanks for your helpful comments — tom, Jul 14 '22 at 00:29
@JohnBollinger - thanks for your comments. yes IPC costs are an issue. and that is why I am trying to spread the calculation over multiple cores,... ..., I hope that having threads on different cores that having one thread checking a memory locations would not slow down the progress of other cores, but I may be wrong about that... ... I am concerned by your point about a possible delay between one thread setting a particular value in memory and another thread seeing the updated value. That could cause all sorts of issues - do you have any further information about that? — tom, Jul 14 '22 at 00:38
At the risk of sounding snarky... the fastest communication is no communication at all; just each thread doing its own separate thing as fast as it can. If you can architect your program so that threads communicate as little as possible, you can maximize speed without having to worry so much about making communications 100% optimally efficient. — Jeremy Friesner, Jul 14 '22 at 01:29
@JeremyFriesner, I like your idea, sadly after step A the results from all the different threads need to be collated together in step B so synchronisation is unavoidable and the threads cannot just run independently — tom, Jul 14 '22 at 02:23

Willis Hershey · Answer 1 · 2022-07-13T19:39:58.993

Exact synchronization tools available to you are going to depend almost entirely on the version (and/or flavor) of C you're using and your OS, so whether or not your compiler supports C11 <threads.h>, whether you're using a POSIX based system, which would give you access to <pthread.h> and <semaphore.h> etc. The overall behavior of the synchronization tools won't vary much from system to system, but exact initialization and call syntax will.

The problem with trying to solve this problem how your pseudo code suggests, (without a synchronization library,) is you're going to end up with the control thread spinning in a loop checking every single worker thread's progress until they're all complete, while every worker that has finished the current task is going to be spinning in a loop checking over and over again if the control thread has given it the go-ahead to continue on, which is a huge waste of CPU resources, and can be a nightmare to debug for complicated race-condition and atomic-access reasons.

The proper way to synchronize threads in an efficient way is to let the operating system do the scheduling. This means you're looking for a way to tell each thread to go to sleep when it reaches the point between steps A and B, then have the OS wake them all up when all N worker threads have finished A, so they can do B then all go to sleep again, and so on.

This is an example of a synchronized threading system using C11 <threads.h> that creates N threads that each call doA(), then sleep until the last one finishes, then each call doB() and synchronize again then start over. This is effectively a manual implementation of a pthread barrier since they aren't available in the <threads.h> library.

#define N 20 //Or however many threads you need
#define NUM_LOOPS 4 //However many times AB must be done

//One instance of this struct will sit on the stack of main()
//and every worker thread will have a reference to it
struct worker{
  volatile int count; //Number of threads finished with current part of the job
  mtx_t mutex;        //Make accessing "count" thread-safe
  cnd_t condition;    //Synchronize wakeup and sleep
};


int worker_thread(void *pt){
  struct worker *shared = pt; //pt is actually a pointer to synchronization data in main()

  for(int c = 0;c < NUM_LOOPS; ++c){

    doA();
  
    mtx_lock(&shared->mutex);
    if(++shared->count != N){ //If I am not the last to finish A
      cnd_wait(&shared->condition, &shared->mutex); //Unlock mutex and go to sleep
      mtx_unlock(&shared->mutex);  //cnd_wait() automatically re-acquires mutex here
    }
    else{
      shared->count = 0;                  //Reset count
      mtx_unlock(&shared->mutex);         //Unlock the mutex
      cnd_broadcast(&shared->condition);  //Wake up the sleeping worker threads
    }
  
    doB();

    mtx_lock(&shared->mutex);
    if(++shared->count != N){ //If I am not the last to finish B
      cnd_wait(&shared->condition, &shared->mutex); //Unlock mutex and go to sleep
      mtx_unlock(&shared->mutex);  //cnd_wait() automatically re-acquires mutex here
    }
    else{
      shared->count = 0;                  //Reset count
      mtx_unlock(&shared->mutex);         //Unlock the mutex
      cnd_broadcast(&shared->condition);  //Wake up the sleeping worker threads
    }

  }
  return 0;
}

int main(){
  thrd_t threads[N];
  struct worker shared;
  shared.count = 0;
  mtx_init(&shared.mutex, mtx_plain); //To allow for thread-safe access to count
  cnd_init(&shared.condition);        //To synchronize sleep and wake-up

  //Create all N threads
  for(int c = 0;c < N; ++c)
    thrd_create(&threads[c], worker_thread, &shared); //Pass reference to synchronization data

  //Reap all N threads, sleeping while waiting for them to terminate
  for(int c = 0;c < N; ++c)
    thrd_join(threads[c],NULL);

  //Cleanup
  mtx_destroy(&shared.mutex);
  cnd_destroy(&shared.condition);
}

Many thanks for your answer and for the helpful comment above. - version of C I am not 100% sure of on the machine that I will run on, but the OS is linux and ```pthread.h``` is the library that I use to start the 'worker threads' that do the calculations. — tom, Jul 14 '22 at 00:00
With `pthread.h` you’ll have an equivalent to every function I used as well as a `pthread_barrier_t` type and related functions that will work better than the pseudocode logic you provided — Willis Hershey, Jul 14 '22 at 00:05

score 1 · Answer 2 · answered Jul 14 '22 at 00:35

So my question is... ... Is there a better way to send messages between threads (I guess the answer is yes)...

The Swiss army knife of thread synchronization is the condition variable, which works in conjunction with a mutex and some shared data (such as your flag variables). You could build a reasonable solution with that, and that would probably be the right choice if indeed you want to have a separate manager thread controlling when the others step from phase to phase.

For your particular problem, however, it sounds like a barrier -- or probably two -- would afford a simpler solution. That could also remove the need for a separate manager thread.

Are any of the better ways faster or at least as fast? (I guess the answer may be no.. but really interested to hear any thoughts about this)

[...]

.... Despite the potential drawbacks of the method I have tried to outline in the pseudo code I am interested to know if the suggested 'barrier' method in pthread.h will be as fast, faster or slower than the method in the pseudo code.

Inasmuch as the pseudocode appears to demonstrate two or more threads accessing a shared variable without synchronization, with some of the accesses being writes, the speed question is moot. The behavior of a program in which such a pattern of activity occurs is undefined.

In practice, that might very well manifest as one thread not observing some or all of the other's writes in a timely manner, or perhaps ever. Naturally, that would be devastating to your scheme. There is no point in debating the performance of code that is semantically wrong.

It's unclear whether it makes sense to worry about the IPC speed. If your steps are so fine-grained that the synchronization costs constitute a significant fraction of the overall time, then you need to make each thread do more work between needing to synchronize. If not, then you are wasting your time by worrying about it.

Many thanks for your answer. I can see completely that **undefined** behaviour would be devastating. I think I need to read up about synchonisation and if necessary ask another question about how to ensure that all the memory write changes of threads are completed and visible to all the other threads. — tom, Jul 14 '22 at 00:47

Fastest communication between threads in C

2 Answers2