Why doesn't multithreading improve performance in this program for finding primes?

Question

Running time is about the same, regardless of the number of threads. I am having trouble figuring out why. I know that the threads are running in parallel as they are supposed to, but I don't have even a good guess as to why there would be no performance improvement. (approx. 21 seconds to find all primes less than 8 million, for both single and multiple threads) What is going on here?

typedef struct prime_finder_vars {
    long from;
    long to;
    int idx;
} PrimeFinderVars;

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return FALSE;
    }
    return TRUE;
}

void *prime_finder(void *pf) {

    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;

    long next_cand = pf_vars->from;
    while (next_cand < pf_vars->to) {
        if (is_prime(next_cand)) {
            ++counts[pf_vars->idx];
        }
        next_cand += 2;
    }
    return pf;
}


int main(void) {

    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;

    clock_gettime(CLOCK_REALTIME, &start);

    pthread_t threads[NUM_THREADS];
    PrimeFinderVars vars[NUM_THREADS];

    int slice_size = SEARCH_RANGE / NUM_THREADS;

    for (int i = 0; i < NUM_THREADS; i++) {

        vars[i].from = i * slice_size + 1;
        vars[i].to = (i + 1) * slice_size;
        vars[i].idx = i;

        pthread_create(&threads[i], NULL, prime_finder, &vars[i]);

    }

    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);

    start_sec = start.tv_sec + start.tv_nsec / NANO_PER_SEC;
    end_sec = end.tv_sec + end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
}

You don't show how `NANO_PER_SEC` is defined, but unless it is something like `1E9` or `1000000000.0` (a `double` value), the expression `start.tv_nsec / NANO_PER_SEC` will always be zero because the `tv_nsec` value will always be smaller than `NANO_PER_SEC`. Similarly with `end.tv_nsec`. You don't show your `isprime()` function, either. There are faster and slower variants possible. — Jonathan Leffler, Oct 16 '20 at 02:39
If you're really seeing *no* performance difference then you're probably not getting *bona fide* parallel execution. That are any number of possible reasons for that, but one of the more likely possibilities is that you may need to tell your compiler that you really want it. It's not uncommon for compilers to require special options for compiling parallel programs. With GCC on most platforms, for example, you need to specify the `-pthread` option at both compile and link time. — John Bollinger, Oct 16 '20 at 18:20

score 4 · Answer 1 · answered Oct 16 '20 at 11:00

4

Workload is not well balanced between threads. Each thread has to check about the same number of candidates, but for threads with higher indexes it takes more time to check each candidate, than for threads with lower indexes.

I would rewrite the main loop like this:

for (long candidate = pf_vars->idx;
     candidate < SEARCH_RANGE;
     candidate += NUM_THREADS) {
    if (is_prime (candidate)) {
        ++counts [pf_vars->idx];
    }
}

NUM_THREADS has to be prime itself for this to work efficiently.

Also, I doubt your code produces correct results, as in case pf_vars->from is even, prime_finder will check only even candidates which doesn't make much sense.

Also, threads run in parallel only when they run on different cores. If the number of thread is much more than the number of cores, then performance will degrade as switching a core between several threads also takes some time.

answered Oct 16 '20 at 11:00

Mikhail Vladimirov

13,572
1
38
40

Also `pthread_create` and `pthread_join` are very expensive functions. You could probably check quite a few of the smaller primes in the time it takes to do one `pthread_create`. – JeremyP Oct 16 '20 at 11:11
1

@JeremyP These functions are called only `NUM_THREADS` times each, so probably don't consume significant time in total. – Mikhail Vladimirov Oct 16 '20 at 11:12
They involve system calls and probably memory read/write barriers so you can almost certainly do a lot of work in the time it takes to return. Also, you don't know how big `NUM_THREADS` is. The question says "if there is a large number of threads". – JeremyP Oct 16 '20 at 11:16
You're correct that it doesn't work properly if `pf_vars->from` is even; I have just been choosing values so that the slice size for the threads is an even factor of the search range. There is also not a ton of overhead associated with the threads; the time difference between running with 2 threads and with 1024 is only about one second for a search range of 8 million candidates. Could the problem just be that the threads are all running on the same core, so they're not really running in parallel? – amiller3513 Oct 16 '20 at 11:47
How many cores do you nave? Common practice is to set the number of thread to double the number of cores. – Mikhail Vladimirov Oct 16 '20 at 12:01
4, but the problem is that there is virtually no difference in performance between running with 1 thread or 8. – amiller3513 Oct 16 '20 at 15:05
@MikhailVladimirov I was wrong about the cost of the thread functions, at least with < 50 threads. See my tests below. – JeremyP Oct 16 '20 at 15:54

JeremyP · Accepted Answer · 2020-10-17T15:16:09.957

This is an interesting question. Everything Mikhail Vladimirov says is true but I decided to do some testing on my laptop to see what I got. My laptop is a modern MacBook pro with an eight core i9. I'm not sure if it is hyper threaded or not, but here are my results:

I tested with the number of threads varying between 1 and 50 and a search range of 10,000,000.

With one thread it takes nearly eleven seconds but this drops rapidly to around 1.5 seconds with 16 threads, and it doesn't get any better after that.

My conclusion is

My comment on Mikhail's answer about the cost of the thread functions is wrong, at least on my platform. I see no increased overhead with more threads
There's something wrong with your thread library.

I think you probably need to satisfy yourself that the threads really are running in parallel on separate cores. One explanation of your results could be that they are all competing for the same CPU.

Just for fun I decided to try profiling the program.

Each step represents another core going to 100%. I'm not sure why the prt with three threads doesn't go to 300%, but you can see with four threads that it goes up to 400% straight away but comes down in steps of 100%. This is the effect of you splitting the task into equal ranges and the threads dealing with lower numbers finishing sooner.

The first 16 data points

Threads Time
1   11.893418
2   7.352520
3   5.117278
4   4.062026
5   3.511605
6   2.892274
7   2.401555
8   2.172573
9   1.910534
10  1.864023
11  1.860944
12  1.369277
13  1.628883
14  1.196646
15  1.626215
16  1.548878

The code I used to produce the test results (slightly modified from yours).

#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <stdbool.h>

#define SEARCH_RANGE    10000000
#define NANO_PER_SEC    1000000000

typedef struct prime_finder_vars {
    long from;
    long to;
    int* count;
} PrimeFinderVars;

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return false;
    }
    return true;
}


void *prime_finder(void *pf)
{

    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;

    long next_cand = pf_vars->from;
    while (next_cand < pf_vars->to)
    {
        if (is_prime(next_cand))
        {
            (*pf_vars->count)++ ;
        }
        next_cand += 2;
    }
    return pf;
}


void trial(int numThreads)
{
    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;

    clock_gettime(CLOCK_REALTIME, &start);

    int counts[numThreads];
    pthread_t threads[numThreads];
    PrimeFinderVars vars[numThreads];

    int slice_size = SEARCH_RANGE / numThreads;

    for (int i = 0; i < numThreads; i++)
    {
        counts[i] = 0;
        vars[i].from = i * slice_size + 1;
        vars[i].to = (i + 1) * slice_size;
        vars[i].count = &counts[i];

        pthread_create(&threads[i], NULL, prime_finder, &vars[i]);

    }

    for (int i = 0; i < numThreads; i++)
    {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);

    start_sec = (double)start.tv_sec + (double)start.tv_nsec / NANO_PER_SEC;
    end_sec = (double)end.tv_sec + (double)end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
    printf("%d\t%f\n", numThreads, elapsed_sec);
}

int main()
{
    printf("Threads\tTime\n");
    for (int threads = 1 ; threads <= 50 ; ++threads)
    {
        trial(threads);
    }
}

I pursued this a bit further over the last day or two. Firstly, I was intrigued about why there seemed to be a double line of timings: After about 12 threads, a run would take either 1.5 seconds or 1 second. I theorised above it was because of the bug Mikhail mentioned so I plotted the actual answer given for each number of threads and found that, while the answer was usually around 664,579 it would often be around half that and unsurprisingly, when the answer was half the real answer, that corresponded to the lower of the two timing lines.

So I fixed that bug and the double line effect disappeared. However, I was still getting more than one different answer depending on the number of threads.

The reason for this is that there are two more bugs.

the original algorithm fails to test the top number in each range.
The size of the ranges is calculated by doing an integer division of the search range by the number of threads. Unless there is no remainder, numbers at the top of the search range will not be checked.

I fixed both bugs and did a third run. This didn't affect the timings appreciably, but I got the same answer for each number of threads used.

For comparison, I wrote a sieve of Eratosthenes and timed that too. Using it and a single thread took only 0.2 seconds - about seven times faster than the fastest number of cores.

I've published a spreadsheet of the results and there's a git repo of the code.

Interesting! On my machine, I have the same results as before -- multithreading does not increase performance. I will try to figure out what's going on there. This is a great answer. — amiller3513, Oct 16 '20 at 16:15
@amiller3513 One other point, you see how there seem to be two lines on the graph with the result alternating between them. I think that is an artefact of the bug that Mikhail mentioned — JeremyP, Oct 16 '20 at 16:18
Interestingly, this is among the rare cases where splitting the work more ways than there are cores might be helpful, as it can somewhat mitigate the effect of some threads having more work to do than others. — John Bollinger, Oct 16 '20 at 16:28
On the other hand, prime finding is a poor example case in general, because it's hard to write an efficient parallel algorithm for it. Reasonably good (single-threaded) sieve-based algorithms are easy to write, and they *vastly* outperform the OP's approach -- by about two orders of magnitude in my quick and dirty test over the OP's prime-finding range *vs* the best my 12-core workstation could do (maxing out at around 16 threads). — John Bollinger, Oct 16 '20 at 16:35
@JohnBollinger The time complexity for the sieve of Eratosthenes is *O(n log log n)* whereas the time complexity for this algorithm is *O(n^1.5)* I think (for each n, you iterate up to the square root of n). Furthermore, the true Sieve only requires you to do additions and comparisons. This algorithms requires a division for each test, which is expensive in itself. — JeremyP, Oct 17 '20 at 15:23
I believe, that raising the number of threads for beyond the number of cores still shows some benefits only because this unequal workload problem, so after a number of low-index threads already finished, there are still enough high-index threads to fully utilize all the cores. — Mikhail Vladimirov, Oct 18 '20 at 19:11
@MikhailVladimirov +1. That's a good point. Maybe I'll try modifying the algorithm in line with your answer to see what effect that has. — JeremyP, Oct 18 '20 at 20:21

Jonathan Leffler · Answer 3 · 2020-10-16T21:48:12.713

After you've identified 2 and 3 as prime, all remaining primes take the form 6N±1, for N >= 1. To balance the workload across K threads, you should have each of the K threads stepping through its own sequence of values for N: thread T working on 1 + T + K * X, where for each thread, X sequences from 0 upwards. If you have 8 threads, it means thread 0 works on N₀ = { 1, 9, 17, … }, etc. It still means thread K-1 does more work than thread 0 because it is tackling bigger numbers, but the discrepancy is much less than when you slice the ranges horizontally. This means you need to provide each thread with a starting number, S, and the total number of threads, K, and the thread will then set a counter x to values 0, 1, ... and check for primes 6(S + xK)±1.

Now, with that as a decent basis for creating a multi-threaded prime finder. This is closely based on your code. It does use some code that is available in my SOQ (Stack Overflow Questions) repository on GitHub as files timer.c, timer.h, stderr.c and stderr.h in the src/libsoq sub-directory. It uses function isprime() renamed from IsPrime3B() found in the file isprime.c in the src/Primes sub-directory of my SOQ repository. This program is prime-thread.c from the same src/Primes directory.

/* SO 6438-1942 */
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "stderr.h"
#include "timer.h"

#define NANO_PER_SEC 1.0E9
enum { NUM_THREADS = 8 };
enum { MAX_NUMBER  = 10000000 };

static size_t counts[NUM_THREADS];

static const unsigned int small_primes[] =
{
     5,  7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
    53, 59, 61, 67, 71, 73, 79, 83, 87, 89, 91, 97
};
enum { NUM_SMALL_PRIMES = sizeof(small_primes) / sizeof(small_primes[0]) };

/* IsPrime3B() from isprime.c - renamed is_prime() */
static int is_prime(unsigned number)
{
    if (number <= 1)
        return 0;
    if (number == 2 || number == 3)
        return 1;
    if (number % 2 == 0 || number % 3 == 0)
        return 0;
    for (unsigned i = 0; i < NUM_SMALL_PRIMES; i++)
    {
        if (number == small_primes[i])
            return 1;
        if (number % small_primes[i] == 0)
            return 0;
    }
    /* After 97, the next prime numbers are 101, 103, 107, 109 */
    /*
    ** It would be feasible to start this loop from:
    ** i = (((small_primes[NUM_SMALL_PRIMES - 1] + 1) / 6) + 1) * 6
    */
    for (unsigned i = 102; (i - 1) <= number / (i - 1); i += 6)
    {
        if (number % (i - 1) == 0 || number % (i + 1) == 0)
            return 0;
    }
    return 1;
}

typedef struct prime_finder_vars
{
    unsigned from;
    unsigned to;
    unsigned increment;   /* Number of threads */
    unsigned idx;
} PrimeFinderVars;

static void *prime_finder(void *pf)
{
    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;
    printf("Thread %u: from = %u, to = %u, inc = %u\n",
           pf_vars->idx, pf_vars->from, pf_vars->to, pf_vars->increment);

    unsigned next = pf_vars->from;
    while (next < pf_vars->to) {
        unsigned six_n = 6 * next;
        if (is_prime(six_n - 1))
            ++counts[pf_vars->idx];
        if (is_prime(six_n + 1))
            ++counts[pf_vars->idx];
        next += pf_vars->increment;
    }
    printf("Thread %u: done\n", pf_vars->idx);
    return pf;
}

int main(int argc, char **argv)
{
    err_setarg0(argv[0]);
    if (argc != 1)
        err_usage("");
    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;
    Clock clk;

    clk_init(&clk);

    clk_start(&clk);
    clock_gettime(CLOCK_REALTIME, &start);

    pthread_t threads[NUM_THREADS];
    PrimeFinderVars vars[NUM_THREADS];

    int max_n = (MAX_NUMBER + 5) / 6;

    for (int i = 0; i < NUM_THREADS; i++)
    {
        vars[i].from = i + 1;
        vars[i].to = max_n;
        vars[i].idx = i;
        vars[i].increment = NUM_THREADS;

        int rc;
        if ((rc = pthread_create(&threads[i], NULL, prime_finder, &vars[i])) != 0)
            err_syserr("failed to create thread %d: ", i);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);
    clk_stop(&clk);

    start_sec = start.tv_sec + start.tv_nsec / NANO_PER_SEC;
    end_sec = end.tv_sec + end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
    printf("Time 1: %.6f\n", elapsed_sec);
    char buffer[32];
    printf("Time 2: %s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));

    /* Because 2 and 3 are primes but are not analyzed */
    size_t t_count = 2;
    for (int i = 0; i < NUM_THREADS; i++)
    {
        t_count += counts[i];
        printf("%d: %7zu primes found\n", i, counts[i]);
    }
    printf("Total primes found up to %d = %zu\n", MAX_NUMBER, t_count);

    return 0;
}

Example output:

$ timecmd -u -- prime-thread
2020-10-16 12:15:05.101785 [PID 75174] prime-thread
Thread 0: from = 1, to = 1666667, inc = 8
Thread 7: from = 8, to = 1666667, inc = 8
Thread 2: from = 3, to = 1666667, inc = 8
Thread 3: from = 4, to = 1666667, inc = 8
Thread 5: from = 6, to = 1666667, inc = 8
Thread 4: from = 5, to = 1666667, inc = 8
Thread 6: from = 7, to = 1666667, inc = 8
Thread 1: from = 2, to = 1666667, inc = 8
Thread 0: done
Thread 6: done
Thread 4: done
Thread 7: done
Thread 3: done
Thread 5: done
Thread 2: done
Thread 1: done
Time 1: 0.231135
Time 2: 0.231135
0:   83090 primes found
1:   83176 primes found
2:   83023 primes found
3:   82996 primes found
4:   83060 primes found
5:   82995 primes found
6:   83179 primes found
7:   83058 primes found
Total primes found up to 10000000 = 664579
2020-10-16 12:15:05.341489 [PID 75174; status 0x0000]  -  0.239704s
$

There are indeed 664,579 primes less than 10,000,000.

Note that the timecmd program counts the entire running time (start-up and printing) of the prime-thread, whereas the internal timing only counts the thread creation, running, and termination time. That accounts for the 8 ms timing difference. (It's a home-brew program that I use for timing commands. It's loosely similar to the system-provided time command — but significantly different too.)

Given a list of the primes up to 10,000,000, it would be feasible to calculate how many primes each thread should have found. Given that the totals are correct, it is unlikely that there's a problem there, though.

Timing

Note that the question says it took 21 seconds to count the number of primes up to 8,000,000. This code took 0.231 seconds to count the number of primes up to 10,000,000. That suggests that the isprime() function in use is not as good as the one I used.

Indeed, the code shown is:

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return FALSE;
    }
    return TRUE;
}

This does far more work than necessary. There's only one even prime number, 2. This code checks 4, 6, 8, … which are trivially non-prime. That's twice as much work as necessary. Checking for 2 and then only checking odd numbers would be a significant improvement. Checking 2, 3, and then numbers which match 6N±1 gives another improvement.

Even so, checking one third as much data would only improve things by a factor of 3. It is likely that the unbalanced workload is a bigger factor. With 8 threads (0..7), thread 7 working on the range 7,000,000..8,000,000 and it has a lot more computation to do than thread 0 working on the range 0..1,000,000, even though there are fewer primes for it to count.

The question doesn't show a complete MCVE (Minimal, Complete, Verifiable Example — or MRE or whatever name SO now uses) or an SSCCE (Short, Self-Contained, Correct Example). It doesn't show how many threads are in use.

I have not, but perhaps should, parameterize prime-thread.c to take a variable number of threads and a variable range for analysis (and remove the thread debugging printing) and see how much it changes behaviour. On my machine, it is unlikely that more threads will improve things; it may be that fewer threads would be better.

I have a program primes which prints the primes in a given range using a Sieve of Eratosthenes. It prints all 664,579 primes up to 10 million in about 0.135 seconds when the output goes to (SSD) file or /dev/null. That is significantly faster than prime-thread manages to count the primes. Quite a lot of the benefit there is from the better algorithm. This isprime() function does a lot of computation for each candidate number.

Two lessons to draw from this:

Algorithms matter.
Threads aren't a panacea for speeding things up.

Thank you! This is fascinating and informative. I'm aware that my method is quite naive -- this project was mainly an exercise in using multiple threads, as I'm quite new to C. I found the insight here valuable, though. — amiller3513, Oct 16 '20 at 21:45

Why doesn't multithreading improve performance in this program for finding primes?

3 Answers3

Timing