I think all the answers lack a concrete example with implementation of threads across different function, passing parameters and some benchmarks:
// NB: gcc -O3 pthread.c -lpthread && time ./a.out
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#define bool unsigned char
#define true 1
#define false 0
typedef struct my_ptr {
long n;
long i;
} t_my_ptr;
void *sum_primes(void *ptr) {
t_my_ptr *my_ptr = ptr;
if (my_ptr->n < 0 ) // handle misused of function
return (void *)-1;
bool isPrime[my_ptr->i + 1];
memset(isPrime, true, my_ptr->i + 1);
if (my_ptr->n >= 2) { // only one even number can be prime: 2
my_ptr->n += 2;
}
for (long i = 3; i <= my_ptr->i ; i+=2) { // after what only odd numbers can be prime numbers
if (isPrime[i]) {
my_ptr->n += i;
}
for (long j = i * i; j <= my_ptr->i; j+=i*2) // Eratosthenes' Algo, sieve all multiples of current prime, skipping even numbers.
isPrime[j] = false;
}
//printf("%s: %ld\n", __func__, my_ptr->n); // a) if both 'a' and 'b' activated you will notice that both functions are computed asynchronously.
}
void *sum_square(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += (my_ptr->i * my_ptr->i) >> 3;
//printf("%s: %ld\n", __func__, my_ptr->n); // b) if both 'a' and 'b' activated you will notice that both functions are computed asynchronously.
}
void *sum_add_modulo_three(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i % 3;
}
void *sum_add_modulo_thirteen(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i % 13;
}
void *sum_add_twice(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i + my_ptr->i;
}
void *sum_times_five(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i * 5;
}
void *sum_times_thirteen(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i * 13;
}
void *sum_times_seventeen(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i * 17;
}
#define THREADS_NB 8
int main(void)
{
pthread_t thread[THREADS_NB];
void *(*fptr[THREADS_NB]) (void *ptr) = {sum_primes, sum_square,sum_add_modulo_three, \
sum_add_modulo_thirteen, sum_add_twice, sum_times_five, sum_times_thirteen, sum_times_seventeen};
t_my_ptr arg[THREADS_NB];
memset(arg, 0, sizeof(arg));
long iret[THREADS_NB];
for (volatile long i = 0; i < 100000; i++) {
//print_sum_primes(&prime_arg);
//print_sum_square(&square_arg);
for (int j = 0; j < THREADS_NB; j++) {
arg[j].i = i;
//fptr[j](&arg[j]);
pthread_create( &thread[j], NULL, (void *)fptr[j], &arg[j]); // https://man7.org/linux/man-pages/man3/pthread_create.3.html
}
// Wait till threads are complete before main continues. Unless we
// wait we run the risk of executing an exit which will terminate
// the process and all threads before the threads have completed.
for (int j = 0; j < THREADS_NB; j++)
pthread_join(thread[j], NULL);
//printf("Thread 1 returns: %ld\n",iret1); // if we care about the return value
}
for (int j = 0; j < THREADS_NB; j++)
printf("Function %d: %ld\n", j, arg[j].n);
return 0;
}
Output:
Function 0: 15616893616113
Function 1: 41666041650000
Function 2: 99999
Function 3: 599982
Function 4: 9999900000
Function 5: 24999750000
Function 6: 64999350000
Function 7: 84999150000
Conclusion (using 8 threads)
- Without pthread but with optimization flag -O3: 9.2sd
- With pthread and no optimization flag: 31.4sd
- With pthread and optimization flag -O3: 17.8sd
- With pthread and optimization flag -O3 and without pthread_join: 2.0sd. However it doesn't compute the right output since different threads try to access my_ptr->i at the same time.
How comes that multithreading would be slower? It's very simple, initiating a thread is costly in term of cycle, so you have to be sure that your functions are rather complex. This first benchmark is slightly biased as the different functions are very easy to compute.
Conclusion (using 8 threads), replacing the content of each functions with sum_primes (to benchmark the benefits with more advanced computation)
- Without pthread but with auto-vectorization (-O3): 1mn14.4sd
- With pthread but without optimization flags: 2mn18.6sd
- With pthread and with auto-vectorization (-O3): 54.7sd
- With pthread, auto-vectorization and without pthread_join: 2.8sd. However it doesn't compute the right output since different threads try to access my_ptr->i at the same time.
Output:
Function 0: 15616893616113
Function 1: 15616893616113
Function 2: 15616893616113
Function 3: 15616893616113
Function 4: 15616893616113
Function 5: 15616893616113
Function 6: 15616893616113
Function 7: 15616893616113
This is a bit more representative of the real power of multithreading!
Final words
Hence, unless you are multi-threading with complex computation functions, OR in case you don't need to join the threads, it will probably not be worth it due to the cost of initiating threads and also joining them. But again, benchmark it!
Note that auto-vectorization (done through -O3) always yield significant positive results, as there is no cost for using SIMD.
NB2: You can use iret[j] =
to store the result of your thread, they will return 0 on success.