Porting optimized Sieve of Eratosthenes from Python to C++

Question

Some time ago I used the (blazing fast) primesieve in python that I found here: Fastest way to list all primes below N

To be precise, this implementation:

def primes2(n):
    """ Input n>=6, Returns a list of primes, 2 <= p < n """
    n, correction = n-n%6+6, 2-(n%6>1)
    sieve = [True] * (n/3)
    for i in xrange(1,int(n**0.5)/3+1):
      if sieve[i]:
        k=3*i+1|1
        sieve[      k*k/3      ::2*k] = [False] * ((n/6-k*k/6-1)/k+1)
        sieve[k*(k-2*(i&1)+4)/3::2*k] = [False] * ((n/6-k*(k-2*(i&1)+4)/6-1)/k+1)
    return [2,3] + [3*i+1|1 for i in xrange(1,n/3-correction) if sieve[i]]

Now I can slightly grasp the idea of the optimizing by automaticly skipping multiples of 2, 3 and so on, but when it comes to porting this algorithm to C++ I get stuck (I have a good understanding of python and a reasonable/bad understanding of C++, but good enough for rock 'n roll).

What I currently have rolled myself is this (isqrt() is just a simple integer square root function):

template <class T>
void primesbelow(T N, std::vector<T> &primes) {
    T sievemax = (N-3 + (1-(N % 2))) / 2;
    T i;
    T sievemaxroot = isqrt(sievemax) + 1;

    boost::dynamic_bitset<> sieve(sievemax);
    sieve.set();

    primes.push_back(2);

    for (i = 0; i <= sievemaxroot; i++) {
        if (sieve[i]) {
            primes.push_back(2*i+3);
            for (T j = 3*i+3; j <= sievemax; j += 2*i+3) sieve[j] = 0; // filter multiples
        }
    }

    for (; i <= sievemax; i++) {
        if (sieve[i]) primes.push_back(2*i+3);
    }
}

This implementation is decent and automatically skips multiples of 2, but if I could port the Python implementation I think it could be much faster (50%-30% or so).

To compare the results (in the hope this question will be successfully answered), the current execution time with N=100000000, g++ -O3 on a Q6600 Ubuntu 10.10 is 1230ms.

Now I would love some help with either understanding what the above Python implementation does or that you would port it for me (not as helpful though).

EDIT

Some extra information about what I find difficult.

I have trouble with the techniques used like the correction variable and in general how it comes together. A link to a site explaining different Eratosthenes optimizations (apart from the simple sites that say "well you just skip multiples of 2, 3 and 5" and then get slam you with a 1000 line C file) would be awesome.

I don't think I would have issues with a 100% direct and literal port, but since after all this is for learning that would be utterly useless.

EDIT

After looking at the code in the original numpy version, it actually is pretty easy to implement and with some thinking not too hard to understand. This is the C++ version I came up with. I'm posting it here in full version to help further readers in case they need a pretty efficient primesieve that is not two million lines of code. This primesieve does all primes under 100000000 in about 415 ms on the same machine as above. That's a 3x speedup, better then I expected!

#include <vector>
#include <boost/dynamic_bitset.hpp>

// http://vault.embedded.com/98/9802fe2.htm - integer square root
unsigned short isqrt(unsigned long a) {
    unsigned long rem = 0;
    unsigned long root = 0;

    for (short i = 0; i < 16; i++) {
        root <<= 1;
        rem = ((rem << 2) + (a >> 30));
        a <<= 2;
        root++;

        if (root <= rem) {
            rem -= root;
            root++;
        } else root--;

    }

    return static_cast<unsigned short> (root >> 1);
}

// https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
// https://stackoverflow.com/questions/5293238/porting-optimized-sieve-of-eratosthenes-from-python-to-c/5293492
template <class T>
void primesbelow(T N, std::vector<T> &primes) {
    T i, j, k, l, sievemax, sievemaxroot;

    sievemax = N/3;
    if ((N % 6) == 2) sievemax++;

    sievemaxroot = isqrt(N)/3;

    boost::dynamic_bitset<> sieve(sievemax);
    sieve.set();

    primes.push_back(2);
    primes.push_back(3);

    for (i = 1; i <= sievemaxroot; i++) {
        if (sieve[i]) {
            k = (3*i + 1) | 1;
            l = (4*k-2*k*(i&1)) / 3;

            for (j = k*k/3; j < sievemax; j += 2*k) {
                sieve[j] = 0;
                sieve[j+l] = 0;
            }

            primes.push_back(k);
        }
    }

    for (i = sievemaxroot + 1; i < sievemax; i++) {
        if (sieve[i]) primes.push_back((3*i+1)|1);
    }
}

The Python code is not pushing single elements onto the array; it is initializing it with a block of elements and then modifying them in-place. That will be much faster in C++ as well. Start with a `dynamic_bitset` like you have, just fill that in, and then write out the sieve in a single loop. That will probably make your version run much faster, even without trying to copy the techniques from the Python version. — Jeremiah Willcock, Mar 13 '11 at 23:34
Which part of the Python implementation are you having problems understanding? Please add this info to the question. Also, people really shouldn't be answering with ports for you, since like you said, it won't help as much with your understanding. — Merlyn Morgan-Graham, Mar 13 '11 at 23:36
@Jeremiah Willcock: Not correct. During the sieving itself it is modifying in-place, but in the end it IS pushing single elements onto the array (list) `return [2,3] + [3*i+1|1 for i in xrange(1,n/3-correction) if sieve[i]]`. And that's pretty much what I'm doing too. The first loop is sieving in place and already pushing the primes that we know and the second loop is to push the primes that come after the root of the sievelimit. — orlp, Mar 13 '11 at 23:38
@Merlyn Morgan-Graham: I have trouble with the techniques used like the correction variable and in general how it comes together. A link to a site explaining different Eratosthenes optimizations (apart from the simple "well you just skip multiples of 2, 3 and 5" and then get slammed with a 1000 line C file) would be awesome. (EDIT: just read "please add this to the question", so doing it now.) — orlp, Mar 13 '11 at 23:40
@nightcracker: I do not think the updates to `sieve` itself change its length, although I can't tell for sure. They do add elements while returning, as you demonstrate, however. — Jeremiah Willcock, Mar 14 '11 at 00:04
@nightcracker: The NumPy version of the sieve code (that the one you posted is a variant of) does in-place updates of sieve elements without changing the size; it just overwrites parts of the sieve with `False`. — Jeremiah Willcock, Mar 14 '11 at 00:06
@Jeremiah Willcock: After looking at the NumPy code it does look friendlier, so I'll try if I can understand it. — orlp, Mar 14 '11 at 00:09
@nightcracker: feel free ta ask any questions you still have, i may remember how i did it and why. — Robert William Hanks, Mar 15 '11 at 22:29
@Robert William Hanks: this doesn't seem to be the right communication device, mind if we could email/MSN/IRC/whatever. Ofcourse I will add corrections to my question that arise from this conversation. — orlp, Mar 15 '11 at 22:37
@nightcracker: start the first for from 1, you dont need sieve[0] = 0; if in the last for you start from 1 too. but plz do a test to confirm. — Robert William Hanks, Mar 17 '11 at 13:36

Jeremiah Willcock · Accepted Answer · 2011-03-14T00:37:13.173

I'll try to explain as much as I can. The sieve array has an unusual indexing scheme; it stores a bit for each number that is congruent to 1 or 5 mod 6. Thus, a number 6*k + 1 will be stored in position 2*k and k*6 + 5 will be stored in position 2*k + 1. The 3*i+1|1 operation is the inverse of that: it takes numbers of the form 2*n and converts them into 6*n + 1, and takes 2*n + 1 and converts it into 6*n + 5 (the +1|1 thing converts 0 to 1 and 3 to 5). The main loop iterates k through all numbers with that property, starting with 5 (when i is 1); i is the corresponding index into sieve for the number k. The first slice update to sieve then clears all bits in the sieve with indexes of the form k*k/3 + 2*m*k (for m a natural number); the corresponding numbers for those indexes start at k^2 and increase by 6*k at each step. The second slice update starts at index k*(k-2*(i&1)+4)/3 (number k * (k+4) for k congruent to 1 mod 6 and k * (k+2) otherwise) and similarly increases the number by 6*k at each step.

Here's another attempt at an explanation: let candidates be the set of all numbers that are at least 5 and are congruent to either 1 or 5 mod 6. If you multiply two elements in that set, you get another element in the set. Let succ(k) for some k in candidates be the next element (in numerical order) in candidates that is larger than k. In that case, the inner loop of the sieve is basically (using normal indexing for sieve):

for k in candidates:
  for (l = k; ; l += 6) sieve[k * l] = False
  for (l = succ(k); ; l += 6) sieve[k * l] = False

Because of the limitations on which elements are stored in sieve, that is the same as:

for k in candidates:
  for l in candidates where l >= k:
    sieve[k * l] = False

which will remove all multiples of k in candidates (other than k itself) from the sieve at some point (either when the current k was used as l earlier or when it is used as k now).

Alright, that's basicly what I have done for my sieve to exclude 2, but then a bit more complicated. I'll read this through a few times and see if I can turn this into a C++ implementation. In every case +1 for the effort. — orlp, Mar 14 '11 at 00:23
@nightcracker: I got the index->number computations done now. The code is tricky, but that's mostly because of the indexing on `sieve`. — Jeremiah Willcock, Mar 14 '11 at 00:27
@nightcracker: See my second part and if that makes more sense. — Jeremiah Willcock, Mar 14 '11 at 00:38

score 1 · Answer 2 · answered Mar 22 '11 at 05:40

Piggy-Backing onto Howard Hinnant's response, Howard, you don't have to test numbers in the set of all natural numbers not divisible by 2, 3 or 5 for primality, per se. You need simply multiply each number in the array (except 1, which self-eliminates) times itself and every subsequent number in the array. These overlapping products will give you all the non-primes in the array up to whatever point you extend the deterministic-multiplicative process. Thus the first non-prime in the array will be 7 squared, or 49. The 2nd, 7 times 11, or 77, etc. A full explanation here: http://www.primesdemystified.com

score 0 · Answer 3 · answered Mar 14 '11 at 00:34

As an aside, you can "approximate" prime numbers. Call the approximate prime P. Here are a few formulas:

P = 2*k+1 // not divisible by 2

P = 6*k + {1, 5} // not divisible 2, 3

P = 30*k + {1, 7, 11, 13, 17, 19, 23, 29} // not divisble by 2, 3, 5

The properties of the set of numbers found by these formulas is that P may not be prime, however all primes are in the set P. I.e. if you only test numbers in the set P for prime, you won't miss any.

You can reformulate these formulas to:

P = X*k + {-i, -j, -k, k, j, i}

if that is more convenient for you.

Here is some code that uses this technique with a formula for P not divisible by 2, 3, 5, 7.

This link may represent the extent to which this technique can be practically leveraged.

Porting optimized Sieve of Eratosthenes from Python to C++

3 Answers3

Linked