3

So I created the below method to find all the primes up to a certain number. Any suggestions as to how to speed it up?

I call it like so;

interval = (value + NOOFTHREADS - 1) / NOOFTHREADS;
int max = interval * NOOFTHREADS;

tickets = new List<int>(NOOFTHREADS);
for (int i = 1; i <= NOOFTHREADS; i++)
{
    tickets.Add(i * (max / NOOFTHREADS));
}

Enumerable.Range(1, NOOFTHREADS)
.AsParallel()
.ForAll(_ => findPrimes());

With some global variables;

static List<int> vals = new List<int>();
static List<int> tickets;
static int interval = new int();

And the method;

public static void findPrimes()
    {
        int myTicket;
        lock (tickets)
        {
            myTicket = (int)tickets.Last();
            tickets.RemoveAt(tickets.Count - 1);
        }
        var max = myTicket;
        int min = max - interval +1;
        int num;

        var maxSquareRoot = Math.Sqrt(max);
        var eliminated = new System.Collections.BitArray(max + 1);
        eliminated[0] = true;
        eliminated[1] = true;
        for (int i = 2; i < (max) / 2; i++)
        {
            if (!eliminated[i])
            {
                if (i < maxSquareRoot)
                {
                    num = ((min + i -1 )/i)*i;

                    if (num == i)
                        num = num + i;

                    for (int j =num; j <= max; j += i)
                        eliminated[j] = true;
                }
            }
        }
        for (int b = (int)min; b < max; b++)
        {
            if (!eliminated[b])
                lock(vals)
                    vals.Add(b);
        }
    }
windowsgm
  • 1,566
  • 4
  • 23
  • 55
  • If your edit is meant to indicate that you want further speed-up, that should be a new question. However, the normal steps are 1. eliminate even numbers from the sieve, that's a factor of about 2 and simple; 2. eliminate also multiples of 3, that's a further factor of about 1.5 and still relatively easy; 3. eliminate multiples of further small primes, increasingly difficult with decreasing return - or use a library Seriously Tuned By Someone Else, e.g. Kim Walisch's primeSieve (C++) or Daniel Bernstein's primegen (C). – Daniel Fischer Oct 16 '12 at 12:02

4 Answers4

5

The Sieve of Eratosthenes can be parallelised rather easily, you just have to split it into separate chunks and sieve each chunk on its own. You have started with that splitting, but haven't gone far enough to obtain good results. Let's see what problems there are in findPrimes()

var max = myTicket;
int min = max - interval +1;
int num;

var maxSquareRoot = Math.Sqrt(max);
var eliminated = new System.Collections.BitArray(max + 1);

You create a new BitArray for each thread covering all numbers from 0 to max. For the thread sieving the first chunk, that's fine, but for later threads, you're allocating much more memory than needed. With a high upper bound and many threads, that's a problem in itself, you're allocating roughly (NOOFTHREADS + 1) * limit / 2 bits where only about limit bits are needed. For fewer threads and/or lower limits, you're still deteriorating locality and will have more cache misses.

eliminated[0] = true;
eliminated[1] = true;
for (int i = 2; i < (max) / 2; i++)

You should stop the outer loop when i > maxSquareRoot. Then the loop body doesn't do anything productive anymore, it just performs one read and one or two checks. That doesn't take long for each iteration, but doing that for all i from √max to max adds up if max is e.g. 1011. Doing that for the last chunk alone can take longer than a single-threaded single-chunk sieve.

{
    if (!eliminated[i])

eliminated[i] can only ever be true for i >= min (or i < 2), a situation you will only encounter in the first chunk for i <= maxSquareRoot (unless the limit is ridiculously low). So for the other chunks, you're eliminating also the multiples of 4, 6, 8, 9, 10, 12, 14, ... . A lot of wasted work.

    {
        if (i < maxSquareRoot)

In the case that maxSquareRoot happens to be a prime, you are not eliminating its square, the comparison ought to be <=.

        {
            num = ((min + i -1 )/i)*i;

            if (num == i)
                num = num + i;

            for (int j =num; j <= max; j += i)
                eliminated[j] = true;
        }
    }
}

Now, when the sieving is done, you step through the chunk of the BitArray

for (int b = (int)min; b < max; b++)
{
    if (!eliminated[b])
        lock(vals)
            vals.Add(b);
}

and whenever you find a prime, you lock the list vals and add the prime to it. If there are ever two or more threads done with the sieving at about the same time, they will step on each others toes there, and the locking and waiting will slow down the process further.

To reduce space usage, each thread should create a list of the primes to maxSquareRoot, and use that to eliminate the composites in its chunk, so that the BitArray needs only max - min + 1 bits. Each thread creating its own list duplicates some work, but since the upper limit here is small, it's not much additional work. I don't know how concurrent read-access is handled, if that doesn't add synchronisation overhead, you could also work with only one list for all threads, but I doubt that would gain anything.

The code would look roughly thus:

List<int> sievePrimes = simpleSieve(maxSquareRoot);
// simpleSieve is a standard SoE returning a list of primes not exceeding its argument
var sieve = new System.Collections.BitArray(max - min + 1);
int minSquareRoot = (int)Math.Sqrt(min);
foreach(int p in sievePrimes)
{
    int num = p > minSquareRoot ? p*p : ((min + p - 1)/p)*p;
    num -= min;
    for(; num <= max-min; num += p)
    {
        sieve[num] =true;
    }
}

Now, to avoid the threads stepping on each others toes when adding the primes to the list, each thread should create its own list of primes and append that in one step (I'm not 100% sure that that's faster than adding each prime with its own lock, but I'd be surprised if not)

List<int> primes = new List<int>();
for(int offset = 0; offset <= max-min; ++offset)
{
    if (!sieve[offset])
    {
        primes.Add(min + offset);
    }
}
lock(vals) vals.AddRange(primes);

(and vals should be created with an initial capacity of about the expected number of primes to avoid reallocation for each chunk)

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • The best answer I've gotten on SO. Took me awhile to understand everything but your explainations were great. Thanks. If I had one query it'd be that your method seems to return 1 as a prime number. Was this on purpose? In most cases I thought 1 was disregarded. – windowsgm Oct 10 '12 at 01:11
  • No, that was an oversight. I'm used to starting my sieves at 5 or 7, so I don't need to take special precautions for 1, sorry. – Daniel Fischer Oct 10 '12 at 12:34
0

Is there a reason for implementing that textbook math-readable-square-root algorithm? Have a look at Prime Number Algorithm

Community
  • 1
  • 1
  • 4
    this should be a comment, not really answering your question. – Ярослав Рахматуллин Oct 09 '12 at 12:56
  • Sieve of Eratosthenes does not lend itself to easy parallelization. – spender Oct 09 '12 at 13:00
  • Your link doesn't appear to be a million miles from what I'm doing. Although I don't know c. How could I apply that c example to my code? – windowsgm Oct 09 '12 at 13:23
  • Ignore my answer. spender has a very good point. That algorithm (suggested by peoro) goes sequentially from 2 to a set limit. There is no simple way to start several treads starting at different positions. That algorithm has (as far as I can see) has O(n) complexity, so it seemed like a good thing to throw at you. You don't need c for it. My answer is silly, I can't really help you with your problem, sorry. – Ярослав Рахматуллин Oct 09 '12 at 13:29
0

There is already a lot of documentation regarding performance of parallel vs serial code. Microsoft recommends that you read their documentation on using Parallel functionality before converting code to use parallel architecture.

Consider reading the following Q/A I have posted on here:

Nested Parallel.ForEach loops

Performance profiling in .NET

Community
  • 1
  • 1
Matthew Layton
  • 39,871
  • 52
  • 185
  • 313
  • Are you saying (according to your second link) that my results might not be accurate enough? – windowsgm Oct 09 '12 at 13:41
  • It depends on how you are testing the performance of a serial vs. parallel loop. If you are testing using .NET's Stopwatch class, your results will not be entirely accurate. For best accuracy, you need to run your code through a profiler. Parallel loops are not always suited across all scenarios, sometimes it will be the case that a serial loop is faster! – Matthew Layton Oct 09 '12 at 13:44
  • As an idea, you could use Stopwatch, and iterate over your code several times, adding all the results together, and then dividing by the number of iterations, which will give you a rough average of the execution time. The reason it is not ideal to base your analysis on the first run is that the first time run, your code is JIT'ed so you need to compensate for JIT overhead. – Matthew Layton Oct 09 '12 at 13:46
0

I think all your locks might be a problem, you should try to avoid locks, they are quite expensive. I don't know the details of your algorithm, but I think you should try to remove the locks somehow. The tickets could be input? They could have their own output queue that you merge when all has completed?

Onkelborg
  • 3,927
  • 1
  • 19
  • 22
  • I was wondering if the locks could be the problem. Can't think of a way around them though. How would I go about inputting different data into each thread's method when they all begin the same method? – windowsgm Oct 09 '12 at 13:45
  • As far as I can tell, you have the number of threads as input, and then you throw the input away. You should call .AsParallel() on your tickets list instead. Having a fixed number of threads isn't a job for PLINQ I think – Onkelborg Oct 09 '12 at 14:10