OpenMP Parallel Computing Prime Numbers

Question

The following is simple code for computing prime numbers:

#define END   100000000

// Only pass odd values to this function
int is_prime(uint32_t v)
{
    uint32_t end = sqrt(v);
    for (uint32_t i = 3; i <= end; i += 2) {
        if ((v % i) == 0) {
            return 0;
        }
    }
    return 1;
}

int main(int argc, char **argv)
{
    // We'll grab 2 as it's the only even prime
    int prime_count = 1;

    uint32_t bracket = 10;

    #pragma omp parallel for num_threads(4)
    for (uint32_t i = 3; i < END; i += 2) {
        if (i > bracket) {
            printf("%12d\t%12d\n", bracket, prime_count);
            bracket *= 10;
        }
        if (is_prime(i)) {
            prime_count++;
        }
    }
    printf("%12d\t%12d\n", bracket, prime_count);
    return 0;
}

Our task is to use OpenMP pragmas to speed up the computation, since looping through 100,000,000 integers will take a long time. I tried putting the #pragma statement where I currently have it in my pasted code, and when I run the code I get the following output:

        10              4
        10              1
        10              4
        10              4
       100             25
      1000             25
     10000             25
    100000             25
  10000000            262
1410065408        5737822

Clearly seeing this as incorrect, I tried putting a #pragma above the for loop in the is_prime() function, but I got compiler error saying

error: "invalid branch to/from OpenMP structured block".

I realize this error is coming from the thread potentially entering the if statement and returning 0 in the middle of the parallel operations. The thread gets terminated prematurely and the entire operation comes apart. I feel like its most appropriate to put the #pragma inside the function call, as it can cut the run time down by v/4 for each call to the function. However, I do not know the proper syntax to do this. If someone can please help me out I'd greatly appreciate it. Thanks

score 2 · Answer 1 · answered Dec 13 '17 at 16:35

user3666197 gives you an example of how to do this by hand. That is fine for understanding, but you don't need to do it yourself - instead you should use the abstraction OpenMP provides - which is a reduction. A reduction will give each thread a private variable and after the parallel computation, those are added together to the original variable. A correct example without the intermediate output is simply:

int prime_count = 1;

#pragma omp parallel for reduction(+:prime_count)
for (uint32_t i = 3; i < END; i += 2) {
    if (is_prime(i)) {
        prime_count++;
    }
}
printf("%12d\t%12d\n", 0, prime_count);

To reproduce the intermediate output in parallel. I would rather have all threads work on a bracket and then one reports the aggregate result. This is still easily achievable with a reduction and gives you a correct summarized information of how many primes are in each chunk:

int prime_count = 1;

#pragma omp parallel
for (uint32_t bracket = 10; bracket <= END; bracket *= 10)
{
    uint32_t start = bracket / 10;
    if (start == 1) start = 3;
    else start++;
    #pragma omp for reduction(+:prime_count)
    for (uint32_t i = start; i < bracket; i += 2) {
        if (is_prime(i)) {
            prime_count++;
        }
    }
    #pragma omp single
    printf("%12d\t%12d %p\n", bracket, prime_count);
}
printf("%12d\t%12d\n", 0, prime_count);

I split the for and parallel because it can be costly to reopen parallel regions again and again.

score 1 · Accepted Answer · answered Dec 12 '17 at 08:45

Do not seek a change in syntax,
there is a need to change the concept of thinking - it happens in parallel

Given num_threads( 4 ) was instructed above, the forthcoming operations happen one besides other three. This means if one thread goes into prime_count++ and takes the "current" value of prime_count, so as to add +1, as the instruction commands, the other threads had no way in your code to wait, before the first taker finishes the ++ and stores the "new" value ( that ought be used by other takers ). Without any coordination tools, the uncontrolled additions look fine just on the first look, but once you realise, that you have many times a wrong "initial" value ( which can and in many cases was in the meantime ++-ed by any of the other threads ), so the final result is just a wreck havoc.

Explicit coordination and/or sharing of variable is possible, but is expensive in time.

[END] PRIME COUNT was omp.FOUND ==    664579 <= 10000000
[END] PRIME COUNT was omp.FOUND ==   1270607 <= 20000000
[END] PRIME COUNT was omp.FOUND ==   1857859 <= 30000000

INF(t:  0)|||: _id
|||(t:  0)|||:          10                1
|||(t:  0)|||:         100                7
|||(t:  0)|||:        1000               44
|||(t:  0)|||:       10000              311
INF(t:  3)|||: _id             
|||(t:  3)|||:          10                0
|||(t:  3)|||:         100                5
|||(t:  3)|||:        1000               37
|||(t:  3)|||:       10000              295
INF(t:  2)|||: _id             
|||(t:  2)|||:          10                1
|||(t:  2)|||:         100                6
|||(t:  2)|||:        1000               43
|||(t:  2)|||:       10000              308
INF(t:  1)|||: _id             
|||(t:  1)|||:          10                1
|||(t:  1)|||:         100                6
|||(t:  1)|||:        1000               43
|||(t:  1)|||:       10000              314
|||(t:  0)|||:      100000             2409
|||(t:  1)|||:      100000             2399
|||(t:  3)|||:      100000             2384
|||(t:  2)|||:      100000             2399
|||(t:  2)|||:     1000000            19669
|||(t:  3)|||:     1000000            19552
|||(t:  1)|||:     1000000            19623
|||(t:  0)|||:     1000000            19653
|||(t:  2)|||:    10000000           166237
|||(t:  0)|||:    10000000           166161
|||(t:  1)|||:    10000000           166204
|||(t:  3)|||:    10000000           165976
:--(t:  0):--:   100000000           464549
:--(t:  1):--:   100000000           464491
:--(t:  2):--:   100000000           464530
:--(t:  3):--:   100000000           464288
[END] PRIME COUNT was omp.FOUND ==  1857859 <= 30000000

After generating correct results, the performance matters:

Better move would be to principally avoid colliding reads/writes at all levels ( and best avoid mis-alignments and cache-inefficiencies ):

_{Hope you may also enjoy an extended syntax-decoration, borrowed from a true-[PARALLEL] language occam-pi, so as to emphasise the thread-aligned code-execution )}

#include<stdio.h>
#include<math.h>
#include<omp.h>

#define END   10000000

int is_prime( long long v )  // Only pass odd values to this function
{             long long end = sqrt( (double) v );
    for (     long long i  = 3;
                        i <= end;
                        i += 2 ) {
        if (      ( v % i ) == 0 ) return 0;
    }
    return 1;
}

int main( int argc, char **argv )
{
#define                                    OMP_NUM_OF_THREADS 4   // CONFIG:
                                                                  //   [SEQ]
    __uint64_t global_per_thread_bracket[  OMP_NUM_OF_THREADS];   //   :--
    __uint64_t global_per_thread_sub_count[OMP_NUM_OF_THREADS];   //   :--
    #pragma omp parallel num_threads(      OMP_NUM_OF_THREADS )   //   :-- [PAR]x4--------------------------||||-4-threads
    {                                                             //         {|...}                         ||||-4-threads
           int        thread_local_ID      = omp_get_thread_num();//         {|...}     // SET: _id         ||||-4-threads
           int        thread_local_COUNT   =  0;                  //         {|...}     // SET: SAFE ZERO   ||||-4-threads
           __uint64_t thread_local_BRACKET = 10;                  //         {|...}     // SET: SAFE   10   ||||-4-threads
                                                                  //         {|...}     //                  ||||-4-threads
           printf( "INF(t:%3d)|||: _id\n",                       thread_local_ID );     // PRINT:?(ORD)     ||||-4-threads
                                                                  //         {|...}     //                  ||||-4-threads
           for ( long long   i  = ( 3 +                    ( 2 * thread_local_ID ) );   // CALC:---------// ||||-4-threads
                             i <  END;                            //         {|...}     //               // ||||-4-threads
                             i += ( 2 * OMP_NUM_OF_THREADS )      //         {|...}     //               // ||||-4-threads
                             )                                    //         {|...}     //               // ||||-4-threads
           {        if (     i >  thread_local_BRACKET )          //         {|...}     // IF:local      // ||||-4-threads
                    {  printf( "|||(t:%3d)|||:%12d\t%12d\n",     thread_local_ID,       // PRINT:?(ORD)  // ||||-4-threads
                                (int)thread_local_BRACKET,        //         {|...}     //    local      // ||||-4-threads
                                (int)thread_local_COUNT           //         {|...}     //    local      // ||||-4-threads
                                );                                //         {|...}     //               // ||||-4-threads
                       thread_local_BRACKET *= (__uint64_t)10;    //         {|...}     // UPDATE:*=10   // ||||-4-threads
                    }                                             //         {|...}     //               // ||||-4-threads
                    if ( is_prime(i) ) thread_local_COUNT++;      //         {|...}     // IF:local++    // ||||-4-threads
           }  // ------------------------------------------------------------{|...}----------------------// ||||-4-threads
           global_per_thread_sub_count[                          thread_local_ID] = thread_local_COUNT;  // ||||-4-threads
           global_per_thread_bracket[                            thread_local_ID] = thread_local_BRACKET;// ||||-4-threads
    } // ---------------------------------------------------------//_________{|___}_________________________||||-4-threads
    long long prime_count = 1;                                    //   :--  SET:=1 as 2 is prime
    for ( int thread_OrdNo  = 0;                                  //   :--  CONSOLIDATE: -----//
              thread_OrdNo <  OMP_NUM_OF_THREADS;                 //   :--                    //
              thread_OrdNo++                                      //   :--                    //
              )                                                   //   :--                    //
    {   prime_count+= global_per_thread_sub_count[thread_OrdNo];  //   :--  UPDATE: +=        //
        printf( ":--(t:%3d):--:%12d\t%12d\n",(int)thread_OrdNo,   //   :--  PRINT:            //
                 (int)global_per_thread_bracket[  thread_OrdNo],  //   :--                    //
                 (int)global_per_thread_sub_count[thread_OrdNo]   //   :--                    //
                 );                                               //   :--                    //
    } // --------------------------------------------------------------:-- -------------------//
    printf( "[END] PRIME COUNT was omp.FOUND == %9lld <= %lld\n", //   :--
             prime_count,                                         //   :--
             END                                                  //   :--
             );                                                   //   :--
}

Feel free to further experiment with the code Live / on-line.

Thank you for the very detailed response! This cleared up a lot! — danielschnoll, Dec 12 '17 at 19:32
I noticed your code runs much slower than a fixed original. Interestingly the performance of `is_prime` seems to depend strongly on the datatype: `int64_t << uint64_t << uint32_t`. — Zulan, Dec 14 '17 at 07:47

score 1 · Answer 3 · 2022-01-03T21:14:23.730

This method is 5 times faster than the corrected OP. It first makes an array of primes up to the square root, single threaded. Then the multiple threads only read from that array.

/* Prime numbers by trial division, with parallel 2nd step */
/* Trick is to make sure the SQR array has top prime > SQR to make the for loop work until the end / the last p[i]
   Lowest MAX is 4 */
/* Timing: 0.13s for up to 10M (4 cores, turbo boost 3GHz) */

/* single threaded, array filling step only: timing:
./a.out 1000000000000
top: 78497 1000003 1000006000009
-> 0.04s for 78K primes up to 1M, ready for 1T (would take hours/years) 
-> does not matter overall
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <omp.h>

/* Fill p[] with primes up to 'max', and one more */
int
primearray(int max, int p[]) {

    p[0] = 3;
    int top = 0, n, i;

    for (n = 5;; n += 2)
        for (i = 0;; i++)

            if (n % p[i] == 0)
                break;
            else
                if (p[i]*p[i] > n) {

                    p[++top] = n;
                    if (n > max)
                        return top;          // return WITH last n
                    break;
                }
}
/* Use primes in p[] to check up to 'max' */
/* "5" is p[1] is the 3rd prime, so start count from 2 */
int
primecount_omp(long max, int p[]) {

    int n, i, sum = 2;

    #pragma omp parallel for private(i) schedule(guided) reduction(+:sum)

    for (n = 5; n <= max; n += 2)
        for (i = 0;; i++)

            if (n % p[i] == 0)
                break;
            else
                if (p[i]*p[i] > n) {
                    sum++;
                    break;
                }
    return sum;
}
/* Call primearray() with SQR and then primecount_omp() with MAX */
/* PNT (x/logx) to find approx. size of array p[] 'psize'. Needs some safety margin. */
int
main(int argc, char** argv) {

    if (argc < 2) {
        printf("No limit given\n");
        return 1;
    }
    long MAX = strtoul(argv[1], 0, 10);
    int  SQR = sqrt(MAX);

    int   psize = 50 + 1.2 * SQR / log(SQR);
    int      *p = malloc(psize * sizeof*p);

    int top = primearray(SQR, p);

    printf("psize: %d top: %d %d %ld\n", psize, top, p[top], (long)p[top]*p[top]);

    int sum = primecount_omp(MAX, p);

    printf("Count: %d (%.2f%%)\n", sum, (double) 100*sum / MAX) ;
    double lnx = MAX / log(MAX);
    printf("LnX:   %.2f (%f)\n",  lnx, sum / lnx);

    return 0;
}

The two functions are similar, but not identical. The first one has:

   p[++top] = n;
   if (n > max)
       return top;
   break;

This is more complicated than the 2nd, parallel one:

   sum++;
   break;

And while sum++ can be done with a reduction(+:sum) clause, the ++top would have to be coordinated between the threads.

This is the fastest simple trial division method, I believe. It has an array, two steps and OMP, but still is simple.

The output is:

psize: 520 top: 445 3163 10004569
Count: 664579 (6.65%)
LnX:   620420.69 (1.071175)

meaning 520 integers were allocated, 445 were used: 3163 is the largest prime needed, with a square above max of 10M.

The exact (allocated) array sixe is not so important, just has to be big enough. A few thousand will get you very far.

I like this because it does the recursive idea:

To find the primes up to N, find first the primes up to SQRT(N)

with only one "recursion" (formally w/o) and offering opportunity to easily parallelize the second, big step.

If it is done in one step, then either you need an additional test or it continues to fill the array to the end.

If it is done really recursively, you sqrt your way down to 3 or 2, and then use short arrays to make longer ones, all in the same way.

Mathieu · Answer 4 · 2017-12-14T07:43:09.223

All the threads could pass in the if block at the same time and modify bracket inconsistency

The pragma "#pragma omp critical" can prevent this:

#define END   100000000

// Only pass odd values to this function
int is_prime(uint32_t v)
{
    uint32_t end = sqrt(v);
    for (uint32_t i = 3; i <= end; i += 2) {
        if ((v % i) == 0) {
            return 0;
        }
    }
    return 1;
}

int main(int argc, char **argv)
{
    // We'll grab 2 as it's the only even prime
    int prime_count = 1;

    uint32_t bracket = 10;

    #pragma omp parallel for num_threads(4)
    for (uint32_t i = 3; i < END; i += 2) {
        #pragma omp critical
        {
            if (i > bracket) {
                printf("%12d\t%12d\n", bracket, prime_count);
                bracket *= 10;
            }
        }
        if (is_prime(i)) {
            prime_count++;
        }
    }
    printf("%12d\t%12d\n", bracket, prime_count);
    return 0;
}

The code above won't totally answer your question: I think you wanted to know how much primes were under 10, 100, 1000, ...

But the openmp way to handle for is the following: it cuts the loop in 4:

the first from 0 to N/4,
the second from N/4 to N/2,
the first from N/2 to 3*N/4,
the second from 3*N/4 to N.

So as to know, how the primes are distributed, you should have some different approach:

int main(int argc, char **argv)
{

    /* number_of_prime[0] is the number of primes between   0 and 10 
       number_of_prime[1] is the number of primes between 10 and 100 
       ...  */
    int number_of_prime[8] = {1, 0};

    uint32_t bracket = 10;

    #pragma omp parallel for num_threads(4)
    for (uint32_t i = 3; i < END; i += 2) {            
        if (is_prime(i)) {
            /* here, i is a prime */
            if (i > 10000000)
            {
                #pragma omp critical
                number_of_prime[7]++;
            }
            else if (i > 1000000)
            {
                #pragma omp critical
                number_of_prime[6]++;
            }
            else if (i > 100000)
            {
                #pragma omp critical
                number_of_prime[5]++;
            }
            else if (i > 10000)
            {
                #pragma omp critical
                number_of_prime[4]++;
            }
            else if (i > 1000)
            {
                #pragma omp critical
                number_of_prime[3]++;
            }
            else if (i > 100)
            {
                #pragma omp critical
                number_of_prime[2]++;
            }
            else if (i > 10)
            {
                #pragma omp critical
                number_of_prime[1]++;
            }
            else 
            {
                #pragma omp critical
                number_of_prime[0]++;
            }                            
        }            
    }

    /* add some printf hero to display results */

    return 0;
}

Edit:

As Zulan said, I've should have use at least atomic instead of critical

`prime_count` in your example is still incremented incorrectly (race condition). Using `omp critical` to increment a counter is a performance. Use a `reduction`, or at least a `atomic` instead. — Zulan, Dec 13 '17 at 16:29
1. Use a reduction for the count, 2. the bracket update may be wrong, but is at least inelegant: put the bracket loop outside the parallel loop over the factors. — Victor Eijkhout, Dec 16 '21 at 14:17

OpenMP Parallel Computing Prime Numbers

4 Answers4

Do not seek a change in syntax, there is a need to change the concept of thinking - it happens in parallel

After generating correct results, the performance matters:

Do not seek a change in syntax,
there is a need to change the concept of thinking - it happens in parallel