How do I efficiently sieve through a selected range for prime numbers?

Question

I've been working through Project Euler and Sphere Online Judge problems. In this particular problem, I have to find all the prime numbers within two given numbers. I have a function that looks promising (based on the Sieve of Eratosthenes), except it's too slow. Can someone spot what is slowing my function down so much, and hint at how I can fix it? Also, some comments about how to approach optimization in general (or links to such comments/books/articles etc,) would be greatly appreciated.

Code:

def ranged_sieve(l, b)
  primes = (l..b).to_a
  primes[0]=nil if primes[0] < 2
  (2..Math.sqrt(b).to_i).each do |counter|
    step_from = l / counter
    step_from = step_from * counter
    l > 3 ? j = step_from : j = counter + counter
    (j..b).step(counter) do |stepped|
      index = primes.index(stepped)
      primes[index] = nil if index
    end
  end
  primes.compact
end

@FrederickCheung Thanks. I've never used a profiler before, but it looks helpful. — Philosobot, Dec 22 '12 at 15:16
if you use better variable names I'll look at it closer, but just make sure you're only checking odd numbers for primes right off the bat, and 'crossing values out' skipping even numbers as well. Should make it half the number of iterations — AJcodez, Dec 22 '12 at 15:24

DarthGizka · Answer 1 · 2014-11-11T13:22:42.030

The PRIME1 problem at SPOJ (Sphere Online Judges) is designed so that you cannot simply sieve up to the upper limit, because in that case you will get hit by the timeout.

One possible approach is superior speed; by adding a few bells and whistles to the standard sieve it can be made to run fast enough to stay well below the timeout limit. Speed optimisations include:

representing only the odd integers in the sieve (50% space savings)
sieving in small, cache-friendly segments that fit into the L1 cache (32 KByte)
presieving by small primes (i.e. blasting a precomputed pattern over the sieve segment)
remembering last (or next) working offset for each prime across segments, instead of recomputing them using slow divisions

Putting all this together cuts the time for sieving the full 2^32 range from something like 20 seconds down to 2 seconds, well below the SPOI timeout. My pastebin has three simple C++ demo programs where you can inspect each of these optimisations in action and see their effect.

A much simpler approach is to do only the work that is necessary: sieve up to the square root of the last number of the target range to get all potential prime factors, and then sieve only the target range itself. That way you can solve to SPOJ problem in less than two dozen lines of code and a few milliseconds runtime. I just finished a demo .cpp for this type of segmented sieving (the difficult part was not the sieve but the test frame for comfortable testing, and the verification of proper operation up to 2^64-1 since there is hardly any reference data).

The sieve itself looks like this; the sieve is an odds-only packed bitmap, and the sieve range is specified in bits for robustness (it's all explained in the .cpp), so you would pass (range_start / 2) for offset:

unsigned char odd_composites32[UINT32_MAX / (2 * CHAR_BIT) + 1];   // the small factor sieve
uintxx_t sieved_bits = 0;                                          // how far it's been initialised

void extend_factor_sieve_to_cover (uintxx_t max_factor_bit);       // bit, not number!

void sieve32 (unsigned char *target_segment, uint64_t offset, uintxx_t bit_count)
{
   assert( bit_count > 0 && bit_count <= UINT32_MAX / 2 + 1 );

   uintxx_t max_bit = bit_count - 1;
   uint64_t max_num = 2 * (offset + max_bit) + 1;
   uintxx_t max_factor_bit = (max_factor32(max_num) - 1) / 2;

   if (target_segment != odd_composites32)
   {
      extend_factor_sieve_to_cover(max_factor_bit);
   }

   std::memset(target_segment, 0, std::size_t((max_bit + CHAR_BIT) / CHAR_BIT));

   for (uintxx_t i = 3u >> 1; i <= max_factor_bit; ++i)
   {
      if (bit(odd_composites32, i))  continue;

      uintxx_t n = (i << 1) + 1;   // the actual prime represented by bit i (< 2^32)

      uintxx_t stride = n;         // == (n * 2) / 2
      uint64_t start = (uint64_t(n) * n) >> 1;
      uintxx_t k;

      if (start >= offset)
      {
         k = uintxx_t(start - offset);
      }
      else // start < offset
      {
         uintxx_t before_the_segment = (offset - start) % stride;

         k = before_the_segment == 0 ? 0 : stride - before_the_segment;
      }

      while (k <= max_bit)
      {
         set_bit(target_segment, k);

         // k can wrap since strides go up to almost 2^32
         if ((k += stride) < stride)
         {
            break;
         }
      }
   }
}

For the SPOJ problem - numbers less than 2^32 - unsigned integers are sufficient for all variables (i.e. uint32_t instead of uintxx_t and uint64_t) and some things could be simplified further. Also, you can use sqrt() instead of max_factor32() for these small ranges.

In the demo code, extend_factor_sieve_to_cover() does the moral equivalent of sieve32(odd_composites32, 0, max_factor_bit + 1) in small, cache-friendly steps. For the SPOJ problem you can simply use the single sieve32() call since there are only 6541 small odd prime factors in numbers less than 2^32, which you can sieve in no time flat.

Hence the trick to solving this SPOJ problem is using segmented sieving, which cuts total runtime to a few milliseconds.

Sorry for the indecent resurrection - my FireFox RSS served this up as 'new' and I didn't look at the date... Mea culpa. — DarthGizka, Nov 10 '14 at 09:18
Can you please explain what this means: "remembering last (or next) working offset for each prime across segments, instead of recomputing them using slow divisions". Also, the size of L1 cache is system dependent. So we need to know the cache size of the cluster at SPOJ, am I correct? Wont it be a bad solution in that case? — Prakhar Agrawal, Feb 27 '17 at 04:41
@Prakhar: sieving the target range in smaller (L1-sized) segments also means iterating over all sieve primes for each and every segment, and hence repeating the expensive modulo divisions for calculating the starting offset for a given prime relative to the current segment. These expensive recalculations can be avoided by remembering the current working offset for each prime, so that computing the starting offset for a given prime in the next segment reduces to subtracting the segment size (if you reuse the same buffer for all segments, making each segment have a physical start offset of 0). — DarthGizka, Feb 27 '17 at 15:16
A data L1 size of 32K is suprisingly common noawadays and a good value to use if nothing else is known. The performance loss that occurs if you guess wrong is surprisingly soft/gradual - execution times won't multiply if you guess a bit too small or a bit too big. — DarthGizka, Feb 27 '17 at 15:21

sawa · Accepted Answer · 2012-12-22T15:56:52.993

I haven't looked fully, but one factor is that, you are replacing a certain value in primes with nil, and later compact-ing it to remove them. This is a waste. Just by doing that directly with delete_at makes it more than twice fast:

def ranged_sieve2(l, b)
  primes = (l..b).to_a
  primes.delete_at(0) if primes[0] < 2
  (2..Math.sqrt(b).to_i).each do |counter|
    step_from = l / counter
    step_from = step_from * counter
    l > 3 ? j = step_from : j = counter + counter
    (j..b).step(counter) do |stepped|
      index = primes.index(stepped)
      primes.delete_at(index) if index
    end
  end
  primes
end

ranged_sieve(1, 100) # => Took approx 8e-4 seconds on my computer
ranged_sieve2(1, 100) # => Took approx 3e-4 seconds on my computer

Another point to improve is that, using a hash is much faster than array as the relevant size gets larger. Replacing your array implementation with a hash, you can get this:

def ranged_sieve3(l, b)
  primes = (l..b).inject({}){|h, i| h[i] = true; h}
  primes.delete(0)
  primes.delete(1)
  (2..Math.sqrt(b).to_i).each do |counter|
    step_from = l / counter
    step_from = step_from * counter
    l > 3 ? j = step_from : j = counter + counter
    (j..b).step(counter) do |stepped|
      primes.delete(stepped)
    end
  end
  primes.keys
end

When you do range_sieve3(1, 100) with this, it is slower than range_sieve2(1, 100) because of the overhead. But as you make the number larger, the superiority becomes salient. For example, I got this result on my computer:

ranged_sieve(1, 1000) # => Took 1e-01 secs
ranged_sieve2(1, 1000) # => Took 3e-02 secs
ranged_sieve3(1, 1000) # => Took 8e-04 secs

Nice job. ranged_sieve3 has the kind of speed I'm looking for. I just started ruby two weeks ago, and was putting off learning hashes because it seemed too unfamiliar at the moment. But I'm definitely looking into it now. Thanks so much for your help :) — Philosobot, Dec 22 '12 at 15:43

How do I efficiently sieve through a selected range for prime numbers?

2 Answers2

Linked