Parallelizing Sieve of Eratosthenes in Java

Question

I am trying to make a parallel implementation of the Sieve of Eratosthenes. I made a boolean list which gets filled up with true's for the given size. Whenever a prime is found, all multiples of that prime are marked false in the boolean list.

The way I am trying to make this algorithm parallel is by firing up a new thread while still filtering the initial prime number. For example, the algorithm starts with prime = 2. In the for loop for filter, when prime * prime, I make another for loop in which every number in between the prime (2) and the prime * prime (4) is checked. If that index in the boolean list is still true, I fire up another thread to filter that prime number.

The nested for loop creates more and more overhead as the prime numbers to filter are progressing, so I limited this to only do this nested for loop when the prime number < 100. I am assuming that by that time, the 100 million numbers will be somewhat filtered. The problem here is that this way, the primes to be filter stay just under 9500 primes, while the algorithm stops at 10000 primes (prime * prime < size(100m)). I also think this is not at all the correct way to go about it. I have searched a lot online, but didn't manage to find any examples of parallel Java implementations of the sieve.

My code looks like this:

Main class:

public class Main {
    private static ListenableQueue<Integer> queue = new ListenableQueue<>(new LinkedList<>());
    private static ArrayList<Integer> primes = new ArrayList<>();
    private static boolean serialList[];
    private static ArrayList<Integer> serialPrimes = new ArrayList<>();
    private static ExecutorService exec = Executors.newFixedThreadPool(10);
    private static int size = 100000000;
    private static boolean list[] = new boolean[size];
    private static int lastPrime = 2;

    public static void main(String[] args) {
        Arrays.fill(list, true);

        parallel();
    }

    public static void parallel() {
        Long startTime = System.nanoTime();
        int firstPrime = 2;

        exec.submit(new Runner(size, list, firstPrime));
    }

    public static void parallelSieve(int size, boolean[] list, int prime) {
        int queuePrimes = 0;
        for (int i = prime; i * prime <= size; i++) {
            try {
                list[i * prime] = false;
                if (prime < 100) {
                    if (i == prime * prime && queuePrimes <= 1) {
                        for (int j = prime + 1; j < i; j++) {
                            if (list[j] && j % prime != 0 && j > lastPrime) {
                                lastPrime = j;
                                startNewThread(j);
                                queuePrimes++;
                            }
                        }
                    }
                }
            } catch (ArrayIndexOutOfBoundsException ignored) { }
        }
    }

    private static void startNewThread(int newPrime) {
        if ((newPrime * newPrime) < size) {
            exec.submit(new Runner(size, list, newPrime));
        }
        else {
            exec.shutdown();
            for (int i = 2; i < list.length; i++) {
                if (list[i]) {
                    primes.add(i);
                }
            }
        }
    }
}

Runner class:

public class Runner implements Runnable {
    private int arraySize;
    private boolean[] list;
    private int k;

    public Runner(int arraySize, boolean[] list, int k) {
        this.arraySize = arraySize;
        this.list = list;
        this.k = k;
    }

    @Override
    public void run() {
        Main.parallelSieve(arraySize, list, k);
    }

}

I feel like there is a much simpler way to solve this... Do you guys have any suggestions as to how I can make this parallelization working and maybe a bit simpler?

For more efficient solution, see https://stackoverflow.com/a/58805513/5221149 — Andreas, Nov 27 '19 at 16:49
Can you estimate how many threads you are running concurrently? — David Soroko, Nov 27 '19 at 16:56
@DavidSoroko I want to run the parallel algorithm on 2 to 8 threads and benchmark those times to see which is the fastest — SD33N, Nov 29 '19 at 08:38

DarthGizka · Accepted Answer · 2019-12-21T12:05:16.020

Creating a performant concurrent implementation of an algorithm like the Sieve of Eratosthenes is somewhat more difficult than creating a performant single-threaded implementation. The reason is that you need to find a way to partition the work in a way that minimises communication and interference between the parallel worker threads.

If you achieve complete isolation then you can hope for a speed increase approaching the number of logical processors available, or about one order of magnitude on a typical modern PC. By contrast, using a decent single-threaded implementation of the sieve will give you a speedup of at least two to three orders of magnitude. One simple cop-out would be to simply load the data from a file when needed, or to shell out to a decent prime-sieving program like Kim Walisch's PrimeSieve.

Even if we only want to look at the parallelisation problem, it is still necessary to have some insight in the algorithm itself and into to machine it runs on.

The most important aspect is that modern computers have deep cache hierarchies where only the L1 cache - typically 32 KB - is accessible at full speed and all other memory accesses incur significant penalties. Translated to the Sieve of Eratosthenes this means that you need to sieve your target range one 32 KB window at a time, instead of striding each prime over many megabytes. The small primes up to the square root of the target range end must be sieved before the parallel dance begins, but then each segment or window can be sieved independently.

Sieving a given window or segment necessitates determining the start offsets for the small primes that you want to sieve by, which means at least one modulo divison per small prime per window and division is a an extremely slow operation. However, if you sieve consecutive segments instead of arbitrary windows placed anywhere in the range then you can keep the end offsets for each prime in a vector and use them as start offsets for the next segment, thus eliminating the expensive computation of the start offset.

Thus, one promising parallelisation strategy for the Sieve of Eratosthenes would be to give each worker thread a contiguous group of 32 KB blocks to sieve, so that the start offset calculation needs to happen only once per worker. This way there cannot be memory access contention between workers, since each has its own independent subrange of the target range.

However, before you begin to parallelise - i.e., make your code more complex - you should first slim it down and reduce the work to be done to the absolute essentials. For example, take a look at this fragment from your code:

for (int i = prime; i * prime <= size; i++)
   list[i * prime] = false;

Instead of recomputing loop bounds in every iteration and indexing with a multiplication, check the loop variable against a precomputed, loop-invariant value and reduce the multiplication to iterated addition:

for (int o = prime * prime; o <= size; o += prime)
   list[o] = false;

There are two simple sieve-specific optimisations that can give significant speed bosts.

1) Leave the even numbers out of your sieve and pull the prime 2 out of thin air when needed. Bingo, you just doubled your performance.

2) Instead of sieving each segment by the small odd primes 3, 5, 7 and so on, blast a precomputed pattern over the segment (or even the whole range). This saves time because these small primes make many, many steps in each segment and account for the lion's share of sieving time.

There are more possible optimisations including a couple more low-hanging fruit but either the returns are diminishing or the effort curve rises steeply. Try searching Code Review for 'sieve'. Also, don't forget that you're fighting a Java compiler in addition to the algorithmic problem and the machine architecture, i.e. things like array bounds checking which your compiler may or may not be able to hoist out of loops.

To give you a ballpark figure: a single-threaded segmented odds-only sieve with precomputed patterns can sieve the whole 32-bit range in 2 to 4 seconds in C#, depending on how much TLC you apply in addition to things mentioned above. Your much smaller problem of primes up to 100000000 (1e8) is solved in less than 100 ms on my aging notebook.

Here's some code that shows how windowed sieving works. For clarity I left off all optimisations like odds-only representation or wheel-3 stepping when reading out the primes and so on. It's C# but that should be similar enough to Java to be readable.

Note: I called the sieve array eliminated because a true value indicates a crossed-off number (saves filling the array with all true at the beginning and it is more logical anyway).

static List<uint> small_primes_between (uint m, uint n)
{
    m = Math.Max(m, 2);

    if (m > n)
        return new List<uint>();

    Trace.Assert(n - m < int.MaxValue);

    uint sieve_bits = n - m + 1;
    var eliminated = new bool[sieve_bits];

    foreach (uint prime in small_primes_up_to((uint)Math.Sqrt(n)))
    {
        uint start = prime * prime, stride = prime;

        if (start >= m)
            start -= m;
        else
            start = (stride - 1) - (m - start - 1) % stride;

        for (uint j = start; j < sieve_bits; j += stride)
            eliminated[j] = true;
    }

    return remaining_numbers(eliminated, m);
}

//---------------------------------------------------------------------------------------------

static List<uint> remaining_numbers (bool[] eliminated, uint sieve_base)
{
    var result = new List<uint>();

    for (uint i = 0, e = (uint)eliminated.Length; i < e; ++i)
        if (!eliminated[i])
            result.Add(sieve_base + i);

    return result;
}

//---------------------------------------------------------------------------------------------

static List<uint> small_primes_up_to (uint n)
{
    Trace.Assert(n < int.MaxValue);    // size_t is int32_t in .Net (!)

    var eliminated = new bool[n + 1];  // +1 because indexed by numbers

    eliminated[0] = true;
    eliminated[1] = true;

    for (uint i = 2, sqrt_n = (uint)Math.Sqrt(n); i <= sqrt_n; ++i)
        if (!eliminated[i])
            for (uint j = i * i; j <= n; j += i)
                eliminated[j] = true;

    return remaining_numbers(eliminated, 0);
}

(Given the question is tagged [tag:java]: what is `var eliminated = new bool[]`? (and how are `bool[]`s stored/indexed for, supposedly, C#?)) — greybeard, Dec 21 '19 at 13:14
@ Greybead: C# `bool[]` is the same as Java `boolean[]`, C# `List<>` ist like Java's `ArrayList<>`, and for `uint` you can use plain `int`. `var` in C# is the same as `auto` in C++ but last time I looked Java didn't have such a feature. It declares a variable with the static type of the initializing expression. Anyway, I wanted to show by example how windowed sieving works, and I had the choice between posting a trusted + verified piece of C# code (which should look very familiar to Java programmers) or cobbling something together in Java without the ability to verify it before posting — DarthGizka, Dec 21 '19 at 15:17

Parallelizing Sieve of Eratosthenes in Java

1 Answers1

Linked