Segmented Prime sieve

Question

Came across this efficient segmented prime sieve on the internet, please help me understand the working especially the use of next vector

How is the specific choice of segment size affecting performance?

const int L1D_CACHE_SIZE = 32768;
void segmented_sieve(int64_t limit, int segment_size = L1D_CACHE_SIZE)
{
    int sqrt = (int) std::sqrt((double) limit);
    int64_t count = (limit < 2) ? 0 : 1;
    int64_t s = 2;
    int64_t n = 3;

    // vector used for sieving
    std::vector<char> sieve(segment_size);

    // generate small primes <= sqrt
    std::vector<char> is_prime(sqrt + 1, 1);
    for (int i = 2; i * i <= sqrt; i++)
        if (is_prime[i])
            for (int j = i * i; j <= sqrt; j += i)
                is_prime[j] = 0;

    std::vector<int> primes;
    std::vector<int> next;

    for (int64_t low = 0; low <= limit; low += segment_size)
    {
        std::fill(sieve.begin(), sieve.end(), 1);

        // current segment = interval [low, high]
        int64_t high = std::min(low + segment_size - 1, limit);

        // store small primes needed to cross off multiples
        for (; s * s <= high; s++)
        {
            if (is_prime[s])
            {
                primes.push_back((int) s);
                next.push_back((int)(s * s - low));
            }
        }
        // sieve the current segment
        for (std::size_t i = 1; i < primes.size(); i++)
        {
            int j = next[i];
            for (int k = primes[i] * 2; j < segment_size; j += k)
                sieve[j] = 0;
            next[i] = j - segment_size;
        }

        for (; n <= high; n += 2)
            if (sieve[n - low]) // n is a prime
                count++;
    }

    std::cout << count << " primes found." << std::endl;
}

Reserve your vectors that you push_back on. – Neil Kirk Nov 03 '14 at 18:27 — Neil Kirk, Nov 03 '14 at 18:27

score 1 · Accepted Answer · edited May 23 '17 at 11:57

Here is a more concise formulation of the same algorithm, which should make the principle much clearer (part of full, runnable .cpp for segment size timings @ pastebin). This initialises a packed (odds-only) sieve instead of counting primes but the principles involved are the same. Download and run the .cpp to see the influence of segment sizes. Basically, the optimum should be around the L1 cache size for your CPU. Too small, and the overhead due to increased number of rounds starts dominating; too big, and you'll get penalised by the slower timings of the L2 and L3 caches. See also How does segmentation improve the running time of Sieve of Eratosthenes?.

void initialise_packed_sieve_4G (void *data, unsigned segment_bytes = 1 << 15, unsigned end_bit = 1u << 31)
{
   typedef std::vector<prime_and_offset_t>::iterator prime_iter_t;
   std::vector<prime_and_offset_t> small_factors;

   initialise_odd_primes_and_offsets_64K(small_factors);

   unsigned segment_bits = segment_bytes * CHAR_BIT;
   unsigned partial_bits = end_bit % segment_bits;
   unsigned segments     = end_bit / segment_bits + (partial_bits != 0);

   unsigned char *segment = static_cast<unsigned char *>(data);
   unsigned bytes = segment_bytes;

   for ( ; segments--; segment += segment_bytes)
   {
      if (segments == 0 && partial_bits)
      {
         segment_bits = partial_bits;
         bytes = (partial_bits + CHAR_BIT - 1) / CHAR_BIT;
      }

      std::memset(segment, 0, bytes);

      for (prime_iter_t p = small_factors.begin(); p != small_factors.end(); ++p)
      {
         unsigned n = p->prime;
         unsigned i = p->next_offset;

         for ( ; i < segment_bits; i += n)
         {
            set_bit(segment, i);
         }

          p->next_offset = i - segment_bits;
      }
   }
}

If the offsets were not remembered from segment to segment then they would have to be recomputed every time using at least one division and one multiplication per re-computed index, plus conditionals or serious bit trickery. When sieving the full 2^32 number range (8192 segments of 32 KBytes each) that's at least 53,583,872 slow divisions and the same number of somewhat faster multiplications; that's roughly one second added to the time needed for initialising a full 2^32 sieve (2^31 bits for the odds-only Eratosthenes).

Here's some actual code from one of my older sieves that uses this 'reconstitutive' math:

for (index_t k = 1; k <= max_factor_bit; ++k)
{
   if (bitmap_t::traits::bt(bm.bm, k))  continue;

   index_t n = (k << 1) + 1;     // == index_for_value(value_for_index(k) * 2) == n
   index_t i = square(n) >> 1;   // == index_for_value(square(n))

   if (i < offset)
   {
      i += ((offset - i) / n) * n;
   }

   for ( ; i <= new_max_bit; i += n)
   {
      bitmap_t::traits::bts(bm.bm, i); 
   }
}

That takes about 5.5 seconds for the full sieve (VC++); the code shown first takes only 4.5 seconds with the same compiler, or 3.5 seconds using gcc 4.8.1 (MinGW64).

Here are the gcc timings:

sieve bits = 2147483648 (equiv. number = 4294967295)

segment size    4096 (2^12) bytes ...   4.091 s   1001.2 M/s
segment size    8192 (2^13) bytes ...   3.723 s   1100.2 M/s
segment size   16384 (2^14) bytes ...   3.534 s   1159.0 M/s
segment size   32768 (2^15) bytes ...   3.418 s   1198.4 M/s
segment size   65536 (2^16) bytes ...   3.894 s   1051.9 M/s
segment size  131072 (2^17) bytes ...   4.265 s    960.4 M/s
segment size  262144 (2^18) bytes ...   4.453 s    919.8 M/s
segment size  524288 (2^19) bytes ...   5.002 s    818.9 M/s
segment size 1048576 (2^20) bytes ...   5.176 s    791.3 M/s
segment size 2097152 (2^21) bytes ...   5.135 s    797.7 M/s
segment size 4194304 (2^22) bytes ...   5.251 s    780.0 M/s
segment size 8388608 (2^23) bytes ...   7.412 s    552.6 M/s

digest { 203280221, 0C903F86, 5B253F12, 774A3204 }

Note: an additional second can be shaved from that time by a trick called 'presieving', i.e. blasting a pre-computed pattern into the bitmap instead of zeroing it at the beginning. That brings the gcc timing down to 2.1 s for the full sieve, with this hacked copy of the earlier .cpp. This trick works extremely well together with segmented sieving in cache-sized chunks.

score 0 · Answer 2 · edited Nov 04 '14 at 07:53

I am no expert in this but my gut tells me this:

limit sieve search table

to fit into L1 CACHE of CPU to get the full benefits of the performance boost of current hardware architectures

next vector

if you want to segmentate sieves then you have to remember the last index per each prime already sieved, for example:

sieved primes: 2,3,5

segment size: 8

 |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7| // segments
-----------------------------------------------
2|-   x   x    x   x   x   x   x    x   x   x   x   
3|-     x      x      x      x      x      x      x  
5|-          x          x          x          x      
-----------------------------------------------
 |                 ^                ^                ^ 
                            // next value offset for each prime

so when filling next segment you continue smoothly ...

@PratikKumar btw for some purposes (like non ordered/uniform use of IsPrime? test) are periodic sieves faster for overall performance look here (it might interest you) http://stackoverflow.com/a/22477240/2521214 — Spektre, Nov 04 '14 at 08:03

Segmented Prime sieve

2 Answers2