Here is a more concise formulation of the same algorithm, which should make the principle much clearer (part of full, runnable .cpp for segment size timings @ pastebin). This initialises a packed (odds-only) sieve instead of counting primes but the principles involved are the same. Download and run the .cpp to see the influence of segment sizes. Basically, the optimum should be around the L1 cache size for your CPU. Too small, and the overhead due to increased number of rounds starts dominating; too big, and you'll get penalised by the slower timings of the L2 and L3 caches. See also How does segmentation improve the running time of Sieve of Eratosthenes?.
void initialise_packed_sieve_4G (void *data, unsigned segment_bytes = 1 << 15, unsigned end_bit = 1u << 31)
{
typedef std::vector<prime_and_offset_t>::iterator prime_iter_t;
std::vector<prime_and_offset_t> small_factors;
initialise_odd_primes_and_offsets_64K(small_factors);
unsigned segment_bits = segment_bytes * CHAR_BIT;
unsigned partial_bits = end_bit % segment_bits;
unsigned segments = end_bit / segment_bits + (partial_bits != 0);
unsigned char *segment = static_cast<unsigned char *>(data);
unsigned bytes = segment_bytes;
for ( ; segments--; segment += segment_bytes)
{
if (segments == 0 && partial_bits)
{
segment_bits = partial_bits;
bytes = (partial_bits + CHAR_BIT - 1) / CHAR_BIT;
}
std::memset(segment, 0, bytes);
for (prime_iter_t p = small_factors.begin(); p != small_factors.end(); ++p)
{
unsigned n = p->prime;
unsigned i = p->next_offset;
for ( ; i < segment_bits; i += n)
{
set_bit(segment, i);
}
p->next_offset = i - segment_bits;
}
}
}
If the offsets were not remembered from segment to segment then they would have to be recomputed every time using at least one division and one multiplication per re-computed index, plus conditionals or serious bit trickery. When sieving the full 2^32 number range (8192 segments of 32 KBytes each) that's at least 53,583,872 slow divisions and the same number of somewhat faster multiplications; that's roughly one second added to the time needed for initialising a full 2^32 sieve (2^31 bits for the odds-only Eratosthenes).
Here's some actual code from one of my older sieves that uses this 'reconstitutive' math:
for (index_t k = 1; k <= max_factor_bit; ++k)
{
if (bitmap_t::traits::bt(bm.bm, k)) continue;
index_t n = (k << 1) + 1; // == index_for_value(value_for_index(k) * 2) == n
index_t i = square(n) >> 1; // == index_for_value(square(n))
if (i < offset)
{
i += ((offset - i) / n) * n;
}
for ( ; i <= new_max_bit; i += n)
{
bitmap_t::traits::bts(bm.bm, i);
}
}
That takes about 5.5 seconds for the full sieve (VC++); the code shown first takes only 4.5 seconds with the same compiler, or 3.5 seconds using gcc 4.8.1 (MinGW64).
Here are the gcc timings:
sieve bits = 2147483648 (equiv. number = 4294967295)
segment size 4096 (2^12) bytes ... 4.091 s 1001.2 M/s
segment size 8192 (2^13) bytes ... 3.723 s 1100.2 M/s
segment size 16384 (2^14) bytes ... 3.534 s 1159.0 M/s
segment size 32768 (2^15) bytes ... 3.418 s 1198.4 M/s
segment size 65536 (2^16) bytes ... 3.894 s 1051.9 M/s
segment size 131072 (2^17) bytes ... 4.265 s 960.4 M/s
segment size 262144 (2^18) bytes ... 4.453 s 919.8 M/s
segment size 524288 (2^19) bytes ... 5.002 s 818.9 M/s
segment size 1048576 (2^20) bytes ... 5.176 s 791.3 M/s
segment size 2097152 (2^21) bytes ... 5.135 s 797.7 M/s
segment size 4194304 (2^22) bytes ... 5.251 s 780.0 M/s
segment size 8388608 (2^23) bytes ... 7.412 s 552.6 M/s
digest { 203280221, 0C903F86, 5B253F12, 774A3204 }
Note: an additional second can be shaved from that time by a trick called 'presieving', i.e. blasting a pre-computed pattern into the bitmap instead of zeroing it at the beginning. That brings the gcc timing down to 2.1 s for the full sieve, with this hacked copy of the earlier .cpp. This trick works extremely well together with segmented sieving in cache-sized chunks.