Optimizing Sieve of Eratosthenes, Adding Wheel to Bool Array

Question

I know there's a few posts about how to implement wheels, but I'm really struggling to see how to implement one with my current approach to the sieve.

I created my own bit array in C, with the following implementation:

#define setBit1(Array, i) (Array[i/INT_BITS] |= (1 << i % INT_BITS))
#define getBit(Array, i) ((Array[i/INT_BITS] & (1 << i % INT_BITS)) ? 1 : 0)
#define setBit0(Array, i) (Array[i/INT_BITS] &= ~(1 << i % INT_BITS))

int * createBitArray(unsigned long long size) {

    // find bytes required, round down to nearest whole byte
    unsigned long long bytesRequired = size / BITS_PERBYTE;

    // round up to highest multiple of 4 or 8 bytes (one int) 
    bytesRequired = (sizeof(int) * (bytesRequired / sizeof(int) + 
        ((size % BITS_PERBYTE * sizeof(int) == 0) ? 0 : 1)));

    // allocate array of "bits", round number of ints required up 
    return (int *)malloc((bytesRequired)); 

}

I've done a few tests in C using the clock(), and I've found that for large array sizes of more than 1,000,000, the bit array, even with its bit manipulations, is at least 200% faster than an int array. It also uses 1/32 of the memory.

#define indexToNum(n) (2*n + 1) 
#define numToIndex(n) ((n - 1) / 2)
typedef unsigned long long LONG; 

// populates prime array through Sieve of Eratosthenes, taking custom 
// odd keyed bit array, and the raw array length, as arguments
void findPrimes(int * primes, LONG arrLength) {

    long sqrtArrLength = (long)((sqrt((2 * arrLength) + 1) - 1) / 2); 
    long maxMult = 0; 
    long integerFromIndex = 0; 

    for (int i = 1; i <= sqrtArrLength; i++) {

         if (!getBit(primes, i)) {
             integerFromIndex = indexToNum(i); 
             maxMult = (indexToNum(arrLength)) / integerFromIndex; 

             for (int j = integerFromIndex; j <= maxMult; j+= 2) {
                 setBit1(primes, numToIndex((integerFromIndex*j)));     
            }

        }   
    }
}

I've been populating the bit array with the index, i, representing a number obtained through (2i + 1). This has the benefit of reducing any time spent iterating over even numbers, and once again reducing the necessary memory of the array by half. 2 is manually added to the primes after. This comes with a consequence of the time spent translating between index and number, but with my tests, for more than 1,000 primes, this approach is faster.

I'm stumped on how I could further optimize; I've reduced array size, I'm only testing to sqrt(n), I'm starting the "sieving" of primes upwards from p * p, I've eliminated all evens, and it's still taking me about 60 seconds for the first 100,000,000 primes in C.

As far as I'm aware, the "wheel" method requires that the actual integers of numbers be stored in the index. I'm really stuck on implementing it with my current bit array.

Tomas Oliveira y Silva (I may have spelled that wrong) wrote the fastest sieve available, and described it at [http://sweet.ua.pt/tos/software/prime_sieve.html](http://sweet.ua.pt/tos/software/prime_sieve.html). — user448810, Feb 23 '18 at 15:51
The array is initialized to zero after the malloc implicitly. Printing all of the bits one by one confirms this. Is this unexpected...? I’m using GCC C99. — , Feb 23 '18 at 20:56
https://stackoverflow.com/questions/8029584/why-does-malloc-initialize-the-values-to-0-in-gcc — Matt Timmermans, Feb 24 '18 at 13:33
before I read all of your question, cf. [this SoE odds-based code on ideone](https://ideone.com/fapob) which gets 75M primes in 5.67s, and 100M primes in under 8 seconds. And it's not even segmented. It uses `vector`, in C++, which is automatically bit-packed. — Will Ness, Feb 25 '18 at 18:14
but more importantly, it does the indexing using only the additions, with no repeated multiplications. also, no `long`s. — Will Ness, Feb 25 '18 at 18:28
@PhamTrung see if [this attempt at explanation](https://stackoverflow.com/a/30563958/849891) helps. — Will Ness, Feb 25 '18 at 20:18

score 1 · Answer 1 · edited Feb 25 '18 at 20:43

When I fix your implementation and run it on my Macbook Pro, it takes 17 seconds to mark off all composites <= 2^31, which is pretty fast.

There are some other things you can try, though. Using a wheel may cut your time in half.

Euler's sieve is linear time if it's carefully implemented, but it requires an int array instead of a bit array.

The Atkin sieve takes linear time and is very practical: https://en.wikipedia.org/wiki/Sieve_of_Atkin

And finally my own (that just means I haven't seen it anywhere else, but I didn't really look either) super fun sieve also takes linear time and finds all the primes <= 2^31 in 6.5 seconds. Thanks for giving me an excuse to post it:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <memory.h>
#include <math.h>

#define SETBIT(mem,num) ( ((uint8_t *)mem)[(num)>>4] |= ((uint8_t)1)<<(((num)>>1)&7) )

int main(int argc, char *argv[])
{
    //we'll find all primes <= this
    uint32_t MAXTEST = (1L<<31)-1;
    //which will be less than this many
    size_t MAXPRIMES = 110000000;

    //We'll find this many primes with the sieve of Eratosthenes.
    //After that, we'll switch to a linear time algorithm
    size_t NUMPREMARK = 48;

    //Allocate a bit array for all odd numbers <= MAXTEST
    size_t NBYTES = (MAXTEST>>4)+1;
    uint8_t *mem = malloc(NBYTES);
    memset(mem, 0, NBYTES);

    //We'll store the primes in here
    unsigned *primes = (unsigned *)malloc(MAXPRIMES*sizeof(unsigned));
    size_t nprimes = 0;


    clock_t start_t = clock();

    //first premark with sieve or Eratosthenes
    primes[nprimes++]=2;
    for (uint32_t test=3; nprimes<NUMPREMARK; test+=2)
    {
        if ( mem[test>>4] & ((uint8_t)1<<((test>>1)&7)) )
        {
            continue;
        }
        primes[nprimes++]=test;
        uint32_t inc=test<<1;
        for(uint32_t prod=test*test; prod<=MAXTEST; prod+=inc)
        {
            SETBIT(mem,prod);
        }
    }


    //Iterate through all products of the remaining primes and mark them off, 
    //in linear time. Products are visited in lexical order of their  
    //prime factorizations, with factors in descending order.

    //stacks containing the current prime indexes and partial products for 
    //prefixes of the current product
    size_t stksize=0;
    size_t indexes[64];
    uint32_t products[64];

    for (uint32_t test=primes[NUMPREMARK-1]+2; test<=MAXTEST; test+=2)
    {
        if ( mem[test>>4] & ((uint8_t)1<<((test>>1)&7)) )
        {
            continue;
        }

        //found a prime!  iterate through all products that start with this one
        //They can only contain the same or smaller primes

        //current product
        uint32_t curprod = (uint32_t)test;
        indexes[0] = nprimes;
        products[0] = curprod;
        stksize = 1;

        //remember the found prime (first time through, nprimes == NUMPREMARK)
        primes[nprimes++] = curprod;

        //when we extend the product, we add the first non-premarked prime
        uint32_t extensionPrime = primes[NUMPREMARK];
        //which we can only do if the current product is at most this big
        uint32_t extensionMax = MAXTEST/primes[NUMPREMARK];

        while (curprod <= extensionMax)
        {
            //extend the product with the smallest non-premarked prime
            indexes[stksize]=NUMPREMARK;
            products[stksize++]=(curprod*=extensionPrime);
            SETBIT(mem, curprod);
        }

        for (;;)
        {
            //Can't extend current product.
            //Pop the stacks until we get to a factor we can increase while keeping
            //the factors in descending order and keeping the product small enough
            if (--stksize <= 0)
            {
                //didn't find one
                break;
            }
            if (indexes[stksize]>=indexes[stksize-1])
            {
                //can't increase this factor without breaking descending order
                continue;
            }

            uint64_t testprod=products[stksize-1];
            testprod*=primes[++(indexes[stksize])];
            if (testprod>MAXTEST)
            {
                //can't increase this factor without exceeding our array size
                continue;
            }
            //yay! - we can increment here to the next composite product
            curprod=(uint32_t)testprod;
            products[stksize++] = curprod;
            SETBIT(mem, curprod);

            while (curprod <= extensionMax)
            {
                //extend the product with the smallest non-premarked prime
                indexes[stksize]=NUMPREMARK;
                products[stksize++]=(curprod*=extensionPrime);
                SETBIT(mem, curprod);
            }
        }
    }
    clock_t end_t = clock();
    printf("Found %ld primes\n", nprimes);
    free(mem);
    free(primes);
    printf("Time: %f\n", (double)(end_t - start_t) / CLOCKS_PER_SEC);
}

Note that my sieve starts with a sieve or Eratosthenes that is a little bit better optimized than yours. The major difference is that we only allocate bits for odd numbers in the bit mask array. The speed difference in that part is not significant.

"Atkin ... is very practical" are you speaking from personal experience? I've searched the linked WP article for the word "practical" and it seems to be saying the exact opposite, with great conviction and much detailed argumentation. :) I also doubt that "Euler's sieve ... requires an int array instead of a bit array". — Will Ness, Feb 25 '18 at 18:36
@WillNess by 'practical', I mean that the implementation doesn't have to be terribly complicated (although there are lots of complicated optimizations you can do), you can use a bit mask for the sieve, it's linear time, and it supports segmentation. W.r.t. Euler's sieve, in order to achieve linear time, you have to be able to cross composites off your list in constant time, and iterate through the remaining ones in linear time. The requires a data structure that is more than a bit mask. An int array is the most compact "easy" basis I know of for that. — Matt Timmermans, Feb 25 '18 at 19:50
we only need the primes up to the sqrt of the top limit, for the Euler sieve. so the sieve itself can be bit array alright, and an additional sqrt-sized int array would have a minimal impact. As for supporting segmentation, I don't think so. we need the previous step's coprimes in full, for each next step. I'd love to be *shown* to be wrong here. As for Atkins, it indeed looks terribly complicated, and there are claims, seem to be well-founded, that even then it loses to the 2,3,5,...11 (or ...17) wheeled SoE. — Will Ness, Feb 25 '18 at 20:09
ah, I think indeed I was in error re bit array. all the coprimes are ints, and if kept as non-zeroed out entries in bit array, accessing them would introduce an additional complexity factor, yes. (additionally, we must use repeated multiplications, this also can't be good for speed). the segmentation question still stands. — Will Ness, Feb 25 '18 at 20:12
My comment about supporting segmentation was for Atkin's sieve. You are right that optimizations can make SoE faster. When you start trading optimizations you can go back and forth, and I think these days the fastest actual implementation is an Eratosthenes sieve: https://primesieve.org/ The difference between O(N) and O(N log log N) is not enough to dominate optimizations — Matt Timmermans, Feb 25 '18 at 20:30

MrSmith42 · Answer 2 · 2018-02-23T12:02:13.007

0

It will always be slower, because of the bit-operation overhead.

But you can try to optimize that.

setBit1(Array, i) can be improved by using a constant array of all preshiftet bits ( I call it ONE_BIT )

NEW:
#define setBit1(Array, i) (Array[i/INT_BITS] |= ONE_BIT[ i % INT_BITS])

same for setBit0(Array, i)
NEW:
#define setBit0(Array, i) (Array[i/INT_BITS] &= ALL_BUT_ONE_BIT[ i % INT_BITS])

Also INT_BITS is most likely a power of 2, so you can replace
i % INT_BITS
by
i & (INT_BITS-1) // of cause you should store INT_BITS-1 in a constant and use that

If and how mutch this can speed up your code, has to be checked by profiling each change.

edited Feb 23 '18 at 12:02

answered Feb 23 '18 at 11:42

MrSmith42

9,961
6
38
49

1

Wouldn't the compiler optimize the modulo operator to just that? https://stackoverflow.com/questions/20393373/performance-wise-how-fast-are-bitwise-operator-vs-normal-modulus I took your advice for the creation of constant arrays, and I'm happy to say that it has slightly increased the speed as measured by clock time by 9.7% for 10,00,000 million primes. For 1,000,000 primes, however, it's not faster. Did this by collecting 30 and 150 data points in each condition for 10,000,000 and 1,000,000 primes respectively, and averaging. I'm taking this a bit too seriously I think. – Feb 24 '18 at 00:38
@Good Rice: Depends a bit on your compiler. Also INT_BITS needs to be a constant otherwise the compiler cannot optimize it. Also this optimization of % is only valid for **unsigned** values. As I mentioned, try it and profile the result to check if there is any improvement. – MrSmith42 Feb 25 '18 at 17:19

Optimizing Sieve of Eratosthenes, Adding Wheel to Bool Array

2 Answers2