How to Optimise my code that computes the sum of all from less than 2 million

Question

I've tried this problem from Project Euler where I need to calculate the sum of all primes until two million.

This is the solution I've come up with -

#include <stdio.h>

int main() {
    long sum = 5;  // Already counting 2 and 3 in my sum.
    int i = 5;     // Checking from 5 
    int count =  0;
    while (i <= 2000000) {
        count = 0;
        for (int j = 3; j <= i / 2; j += 2) {
            // Checking if i (starting from 5) is divisible from 3 
            if (i % j == 0) {   // to i/2 and only checking for odd values of j
                count = 1;
            }
        }
        if (count == 0) {
            sum += i;
        }
        i += 2;
    }
    printf("%ld ", sum);
}

It takes around 480 secs to run and I was wondering if there was a better solution or tips to improve my program.

________________________________________________________
Executed in  480.95 secs    fish           external
   usr time  478.54 secs    0.23 millis  478.54 secs
   sys time    1.28 secs    6.78 millis    1.28 secs

What are the constraints? Why not just run this program, get the answer, then hard-code that answer and have a program that gives that answer faster than you can blink? — Beta, Mar 23 '22 at 07:02
The time the actual summation takes is negligible here, the question should be more like "How do I find prime numbers quickly?" which is a **huge** topic in itself with loads of different ways, many of which is documented online if you do a quick search. — Klas-Kenny, Mar 23 '22 at 07:06
You should check out this question: https://stackoverflow.com/questions/453793/which-is-the-fastest-algorithm-to-find-prime-numbers — Klas-Kenny, Mar 23 '22 at 07:11
With `j <= i/2` -> `j <= sqrt(2)` you would already gain a lot. — Jabberwocky, Mar 23 '22 at 07:16
In addition to testing divisors up to and including the square root of a number only, you could also stop the inner loop early when you have determined that a number is composite. — M Oehm, Mar 23 '22 at 07:29

Jabberwocky · Accepted Answer · 2022-03-23T10:04:04.137

With two little modifications your code becomes magnitudes faster:

#include <stdio.h>
#include <math.h>

int main() {
  long long sum = 5;    // we need long long, long might not be enough
                        // depending on your platform
  int i = 5;
  int count = 0;
  while (i <= 2000000) {
    count = 0;

    int limit = sqrt(i);                  // determine upper limit once and for all
    for (int j = 3; j <= limit; j += 2) { // use upper limit sqrt(i) instead if i/2
        if (i % j == 0) {
          count = 1;
          break;                          // break out from loop as soon 
                                          // as number is not prime
        }
    }
    if (count == 0) {
      sum += i;
    }
    i += 2;
  }
  printf("%lld ", sum);                   // we need %lld for long long
}

All explanations are in the comments.

But there are certainly better and even faster ways to do this.

I ran this on my 10 year old MacPro and for the 20 million first primes it took around 30 seconds.

`int limit = sqrt(i);` works in practice, but I wouldn't recommend it: either use `int limit = round(sqrt(i));` or `i * i <= j` or possibly `i / j <= j`. — chqrlie, Mar 23 '22 at 10:22
Thank you! I completely forgot about breaking the loop when I encounter a non-prime. And you're right sqrt is better than i/2. — Sai Nishwanth, Mar 23 '22 at 11:24

Wisblade · Answer 2 · 2022-03-23T10:23:01.517

This program computes near instantly (even in Debug...) the sum for 2 millions, just need one second for 20 millions (Windows 10, 10 years-old i7 @ 3.4 GHz, MSVC 2019).

Note: Didn't had time to set up my C compiler, it's why there is a cast on the malloc.

The "big" optimization is to store square values AND prime numbers, so absolutely no impossible divisor is tested. Since there is no more than 1/10th of primes within a given integer interval (heuristic, a robust code should test that and realloc the primes array when needed), the time is drastically cut.

#include <stdio.h>
#include <malloc.h>

#define LIMIT 2000000ul     // Computation limit.

typedef struct {
    unsigned long int   p ;     // Store a prime number.
    unsigned long int   sq ;    // and its square.
} prime ;

int main() {
    prime* primes = (prime*)malloc((LIMIT/10)*sizeof(*primes)) ;        // Store found primes. Can quite safely use 1/10th of the whole computation limit.
    unsigned long int primes_count=1 ;
    unsigned long int i = 3 ;
    unsigned long long int sum = 0 ;
    unsigned long int j = 0 ;
    int is_prime = 1 ;

    // Feed the first prime, 2.
    primes[0].p = 2 ;
    primes[0].sq = 4 ;
    sum = 2 ;
    // Parse all numbers up to LIMIT, ignoring even numbers.
    // Also reset the "is_prime" flag at each loop.
    for (i = 3 ; i <= LIMIT ; i+=2, is_prime = 1 ) {
        // Parse all previously found primes.
        for (j = 0; j < primes_count; j++) {
            // Above sqrt(i)? Break, i is a prime.
            if (i<primes[j].sq)
                break ;
            // Found a divisor? Not a prime (and break).
            if ((i % primes[j].p == 0)) {
                is_prime = 0 ;
                break ;
            }
        }
        // Add the prime and its square to the array "primes".
        if (is_prime) {
            primes[primes_count].p = i ;
            primes[primes_count++].sq = i*i ;
            // Compute the sum on-the-fly
            sum += i ;
        }
    }
    printf("Sum of all %lu primes: %llu\n", primes_count, sum);
    free(primes) ;
}

Right, but technically, I would have used a custom integer type with unlimited number of digits, and in C++ preferably to benefit from the syntaxic sugar of overloaded operators... The problem isn't really a challenge with so little numbers, it could be done with a simple precomputed LUT up to 2^32. Because if I set the base type as an `unsigned long long int` for limit and computation (no need for the signed type), then `sum` cannot be set as a "bigger type"... And the problem is still here. — Wisblade, Mar 23 '22 at 09:04
Yes, for naive approaches like this one, `long long` (usually 64 bits) should keep you on the safe side for quite some big numbers. — Jabberwocky, Mar 23 '22 at 10:05
Still tiny numbers, when speaking of primes... The largest known prime number is `2^82,589,933 − 1`, with "only" 24,862,048 digits... It's like bragging about a `long double` instead of `double` to store a "precise" value of Pi. We're outside usual computer ranges. — Wisblade, Mar 23 '22 at 10:20
Thank you! I believe long is enough for <=20mil (since i got the right answer in my program with long), though ig long long would've been a safer bet. — Sai Nishwanth, Mar 23 '22 at 11:29

chqrlie · Answer 3 · 2022-03-23T10:29:19.740

2

Your program can easily be improved by stopping the inner loop earlier:

when i exceeds sqrt(j).
when a divisor has been found.

Also note that type long might not be large enough for the sum on all architectures. long long is recommended.

Here is a modified version:

#include <stdio.h>

int main() {
    long long sum = 5;  // Already counting 2 and 3 in my sum.
    long i = 5;     // Checking from 5 
    while (i <= 2000000) {
        int count = 0;
        for (int j = 3; j * j <= i; j += 2) {
            // Checking if i (starting from 5) is divisible from 3 
            if (i % j == 0) {   // to i/2 and only checking for odd values of j
                count = 1;
                break;
            }
        }
        if (count == 0) {
            sum += i;
        }
        i += 2;
    }
    printf("%lld\n", sum);
}

This simple change drastically reduces the runtime! It is more than 1000 times faster for 2000000:

chqrlie> time ./primesum
142913828922

real    0m0.288s
user    0m0.264s
sys     0m0.004s

Note however that trial division is much less efficient than the classic sieve of Eratosthenes.

Here is a simplistic version:

#include <stdio.h>
#include <stdlib.h>

int main() {
    long max = 2000000;
    long long sum = 0;
    // Allocate an array of indicators initialized to 0
    unsigned char *composite = calloc(1, max + 1);
    // For all numbers up to sqrt(max)
    for (long i = 2; i * i <= max; i++) {
        // It the number is a prime
        if (composite[i] == 0) {
            // Set all multiples as composite. Multiples below the
            // square of i are skipped because they have already been
            // set as multiples of a smaller prime.
            for (long j = i * i; j <= max; j += i) {
                composite[j] = 1;
            }
        }
    }
    for (long i = 2; i <= max; i++) {
        if (composite[i] == 0)
            sum += i;
    }
    printf("%lld\n", sum);
    free(composite);
    return 0;
}

This code is another 20 times faster for 2000000:

chqrlie> time ./primesum-sieve
142913828922

real    0m0.014s
user    0m0.007s
sys     0m0.002s

The sieve approach can be further improved in many ways for larger boundaries.

edited Mar 23 '22 at 10:29

answered Mar 23 '22 at 10:00

chqrlie

131,814
10
121
189

Maybe faster than my own answer (maybe need to test both on the same machine?), but memory usage won't be the same at all. Your last one is the classical "manual" implementation of the sieve, and its memory usage is what limit its relevance for big numbers (even when using only one bit per cell), despite its inherent speed. – Wisblade Mar 23 '22 at 10:27
@chqrlie no need to be sorry. Anyway you've now updated your anwer and there is actually added value now. – Jabberwocky Mar 23 '22 at 10:29
@Wisblade: As I wrote, numerous improvements are possible to further improve this simplistic sieve example, including using fewer bits per number, omitting multiples of small primes, using slices to reduce memory usage and improve cache efficiency... A vast subject indeed. – chqrlie Mar 23 '22 at 10:32
@Wisblade: your approach is fine for small slices, but for larger ones, sieve is more efficient. Your code runs on 0.189s on my old Mac, 50% faster than the simple trial division loop, but still not as fast as the simplistic sieve. – chqrlie Mar 23 '22 at 10:39
As I said, it's more efficient until you don't have anymore memory - and obviously, using swap would kill the performances instantly. Indeed, it's a whole job to optimize prime number finding. It's a common optimization trick to gain performances by burning RAM, but it needs to be said and understood by those who want to go this way. In particular, you need a `calloc`, so any memory overload would take ages for just initialization. You need physical RAM, and even with 16 TERAbytes of RAM, going above 1 trillion will be quite difficult to achieve... – Wisblade Mar 23 '22 at 11:11
Appreciate the help! Can't believe I forgot to break when I encountered a non-prime. – Sai Nishwanth Mar 23 '22 at 11:31
@Wisblade: I agree with your remarks: finding a good threshold between memory consumption and extra computations is an necessary art. Yet I disagree with this: *you need a calloc, so any memory overload would take ages for just initialization.* When one allocates a large block with `calloc()`, zero initialization comes for free because the OS maps cleared pages for security reasons and good malloc implementations take advantage of this feature. – chqrlie Mar 23 '22 at 16:04
@chqrlie Not sure at all that all implementations of `calloc` do that, and I have a big doubt about the fact that just reseting the page would fill the memory with zeros "for free". If I have a bit of time, I'll do a bench between `malloc` and `calloc` on medium-sized allocations (>100 MB). – Wisblade Mar 23 '22 at 20:42
@Wisblade: I cannot tell for the Microsoft library, but most open source `malloc` implementations do when they can assume the OS maps cleared pages, which a requirement on POSIX systems. The clearing does not cost nothing, it is performed by the kernel prior to returning the mmapped pointer, so you get the penalty whether you use `malloc` or `calloc`. Advanced sieve implementations do not map large amounts of memory, they typically use small slices that fit in the L1 cache. (32K or 64K) and keep a state for each prime up to the square root of the upper boundary to avoid all divisions. – chqrlie Mar 23 '22 at 21:28

How to Optimise my code that computes the sum of all from less than 2 million

3 Answers3