3

The standard Sieve of Eratosthenes crosses out most composites multiple times; in fact the only ones that do not get marked more than once are those that are the product of exactly two primes. Naturally, the overdraw increases as the sieve gets bigger.

For an odd sieve (i.e. without the evens) the overdraw hits 100% for n = 3,509,227, with 1,503,868 composites and 1,503,868 crossings-out of already crossed-out numbers. For n = 2^32 the overdraw rises to 134.25% (overdraw 2,610,022,328 vs. pop count 1,944,203,427 = (2^32 / 2) - 203,280,221).

The Sieve of Sundaram - with another explanation at maths.org - may be a bit smarter about that, if - and only if - the loop limits are computed intelligently. However, that's something which the sources I've seen seem to gloss over as 'optimisation', and it also seems that an unoptimised Sundaram gets beaten by an odds-only Eratosthenes every time.

The interesting things is that both create exactly the same final bitmap, i.e. one where bit k corresponds to the number (2 * k + 1). So both algorithms must end up setting exactly the same bits, they just have different ways of going about it.

Does someone have hands-on experience with a competitive, tuned Sundaram? Can it beat the old Greek?

I have slimmed the code for my small factor sieve (2^32, an odds-only Greek) and tuned the segment size to 256 KB, which is as optimal on the old Nehalem with its 256 KB L2 as on the newer CPUs (even though the latter are much more forgiving about bigger segments). But now I have hit a brick wall and the bloody sieve still takes 8.5 s to initialise. Loading the sieve from hard disk is not a very attractive option, and multi-threading is difficult to do in a portable manner (since libs like boost tend to put a monkey wrench into portability)...

Can Sundaram shave a few seconds from the startup time?

P.S.: the overdraw as such is not a problem and will be absorbed by the L2 cache. The point is that the standard Eratosthenes seems to do more than double the work than necessary, which indicates that there may be potential for doing less work, faster.

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
  • A number is crossed off once for each prime factor, so a product of two primes is crossed out twice. The only numbers crossed out once are the powers of primes. But it seems like Sundaram's will cross out for every distinct factorization of an odd number. That may grow even faster than Eratosthenes in the long run. – Teepeemm Oct 30 '14 at 15:42
  • [Sieve of Atkin](http://en.wikipedia.org/wiki/Sieve_of_Atkin) is what you really want. – Sam Harwell Oct 30 '14 at 15:42
  • 1
    @ Teepeemm: the products of any two primes are crossed out exactly once with Eratosthenes, for the lower of the two factors. The loop for the higher factor starts at its own square, thus avoiding lower multiples of itself (which must have been crossed out already). – DarthGizka Oct 30 '14 at 16:33
  • 1
    @ Sam Harwell: how fast is Atkin for small primes (up to 2^32, for initialising the small factor sieve)? I heard Atkin's isn't very competitive for small primes. – DarthGizka Oct 30 '14 at 16:35
  • check out answers by GordonBGood on the sieve of Atkin. His conclusion is that it will always be slower than SoE on 2-3-5-7(-11)-wheel, because 2-3-5 wheel is "baked" into Atkin's. aout the segment size, I'd use one that is *smaller* than my cache, at least by 20%, just to be safe. – Will Ness Oct 31 '14 at 20:16
  • IIRC Sundaram is supposed to be slower than SoE. The multiple markings off of the composites is what gives it the *log log n* factor. If you want to spare it and cross off each exactly once, you're left with the "Euler's sieve" but it is *essentially* non-local (meaning, it should be okay only for contiguous sieves). -- what is this "initialization" you mention? what is there to initialize at all, for the segmented sieve? – Will Ness Oct 31 '14 at 20:23
  • @ Will Ness: thanks, that jibes with results reported elsewhere; I found a paper - [An Analysis of Two Prime Number Sieves](http://research.cs.wisc.edu/techreports/1991/TR1028.pdf) by Jonathan Sorenson - which pits SoE against Atkin's and analyses them in great detail. Also, I found by accident that zeroing a segment before sieving it improves the time for initialising the full 2^32 sieve (a 2^31 bit odds-only SoE) from 8.5 s to 5.2 s, with 256 KByte segment size still optimum. This means that the memory cache issue is crucial to overall performance, and it suggests new avenues for exploration – DarthGizka Oct 31 '14 at 22:42
  • 'Initialisation' refers to sieving the small factors, i.e. prepping the small factors sieve up to the square root of the upper end of the target range. The small factors sieve is global and gets extended on demand. With target ranges for the segmented sieve up to 2^64-1, most of the time the first request causes the small factors sieve to extend to somewhere between 2^31 and 2^32-1. Which used to mean a 10-second wait before first results started rolling in, now its 5..6 seconds. Meanwhile I found [the fastest C++ SoE in the world](http://primesieve.org/); they do the full 2^32 in < 1 second – DarthGizka Oct 31 '14 at 22:52
  • I've yet to familiarise myself with Code Review, and the code is still too much in flux for it anyway. So I uploaded [a stand-alone .cpp for testing the speed at different segment sizes](http://pastebin.com/WY5twnbq) to pastebin; with pre-sieving and segment size 32 KBytes the gcc time is 2.1 seconds for the full sieve. – DarthGizka Nov 03 '14 at 23:43
  • Have a look at [this](http://stackoverflow.com/a/41434702/4543207) I have recently modified the sieve of Sundaram to cut the unnecessary loops and it seems twice as more performant compared to the sieve of Erastosthenes. I guess it's the fastest algorithm in the topic that it's listed under. – Redu Jan 03 '17 at 18:01
  • 1
    @Redu: the Sieve of Sundaram is *strictly worse* than an odds-only Sieve of Eratosthenes. SoS in its standard form uses multiplication for computing the indices of cells to be crossed off instead of additive striding as in the SoE. The second flaw is that SoS fails to skip non-primes in the outer loop (resulting in additional redundant crossings-off of multiples of non-primes). If you fix both flaws (see step-by-step account in my answer) then what's left is an odds-only SoE, confirming once more the old quip that the SoE was quite an improvement on most of its successors. ;-) – DarthGizka Jan 04 '17 at 06:27
  • 1
    Thanks for your reply. I believe i have eliminated the redundant (multiple) crossing out of the non primes by setting a dynamic limit to the second loop and also the first loop runs up to where only it's necessary. I will explain the limits of the loops later today in my answer. SoS automatically handles odds only issue by multiplying the remaining figures by 2n+2 to obtain the final list of primes. So I would expect that's as fast as it gets out from that algorithm. – Redu Jan 04 '17 at 07:00
  • I have modified my algorithm further. If you are interested I would like you to check [this](http://stackoverflow.com/a/41434702/4543207) and kindly tell me what you think. I have managed to get the 50847534 primes up to 10^9 in just 17000 ms with JS single thread without workers. FYI under the same conditions an empty loop counting up to 10^9 takes 4000 ms. It turns out to be faster from any implementation of Eratosthenes or Atkins sieve that i could have found on the net. – Redu Jan 17 '17 at 19:49
  • 1
    @Redu: Please post the code on Code Review and then link it here in a new comment - that way you can profit from the experience of JavaScript gurus (which I'm definitely not). JavaScript is totally different from the usual imperative languages that (eventually) compile to machine code, and it often requires radically different approaches to really make it sing... It stands to reason, though, that an SoA cannot be faster than a an SoE if implemented with the same care, since it's basically the same except that it does more crossing-off iterations (and hence more crossings-off total). – DarthGizka Jan 18 '17 at 17:21

2 Answers2

5

Since there weren't any takers for the 'Sundaram vs. Eratosthenes' problem, I sat down and analysed it. Result: classic Sundaram's has strictly higher overdraw than an odds-only Eratosthenes; if you apply an obvious, small optimisation then the overdraw is exactly the same - for reasons that shall become obvious. If you fix Sundaram's to avoid overdraw entirely then you get something like Pritchard's Sieve, which is massively more complicated.

The exposition of Sundaram's Sieve in Lucky's Notes is probably the best by far; slightly rewritten to use a hypothetical (i.e. non-standard and not supplied here) type bitmap_t it looks somewhat like this. In order to measure overdraw the bitmap type needs an operation corresponding to the BTS (bit test and set) CPU instruction, which is available via the _bittestandset() intrinsic with Wintel compilers and with MinGW editions of gcc. The intrinsics are very bad for performance but very convenient for counting overdraw.

Note: for sieving all primes up to N one would call the sieve with max_bit = N/2; if bit i of the resulting bitmap is set then the number (2 * i + 1) is composite. The function has '31' in its name because the index math breaks for bitmaps greater than 2^31; hence this code can only sieve numbers up to 2^32-1 (corresponding to max_bit <= 2^31-1).

uint64_t Sundaram31_a (bitmap_t &bm, uint32_t max_bit)
{
   assert( max_bit <= UINT32_MAX / 2 );

   uint32_t m = max_bit;
   uint64_t overdraw = 0;

   bm.set_all(0);

   for (uint32_t i = 1; i < m / 2; ++i)
   {
      for (uint32_t j = i; j <= (m - i) / (2 * i + 1); ++j)
      {
         uint32_t k = i + j + 2 * i * j;

         overdraw += bm.bts(k);
      }
   }

   return overdraw;
}

Lucky's bound on j is exact but the one on i is very loose. Tightening it and losing the m alias that I had added to make the code look more like common expositions on the net, we get:

uint64_t Sundaram31_b (bitmap_t &bm, uint32_t max_bit)
{
   uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
   uint64_t overdraw = 0;

   bm.set_all(0);

   for (uint32_t i = 1; i <= i_max; ++i)
   {
      for (uint32_t j = i; j <= (max_bit - i) / (2 * i + 1); ++j)
      {
         uint32_t k = i + j + 2 * i * j;

         overdraw += bm.bts(k);
      }
   }

   return overdraw;
}

The assert was dumped in order to reduce noise but it is actually still valid and necessary. Now it's time for a bit of strength reduction, turning multiplication into iterated addition:

uint64_t Sundaram31_c (bitmap_t &bm, uint32_t max_bit)
{
   uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
   uint64_t overdraw = 0;

   bm.set_all(0);

   for (uint32_t i = 1; i <= i_max; ++i)
   {
      uint32_t n = 2 * i + 1;
      uint32_t k = n * i + i;   // <= max_bit because that's how we computed i_max
      uint32_t j_max = (max_bit - i) / n;

      for (uint32_t j = i; j <= j_max; ++j, k += n)
      {
         overdraw += bm.bts(k);
      }
   }

   return overdraw;
}

Transforming the loop condition to use k allows us to lose j; things should be looking exceedingly familiar by now...

uint64_t Sundaram31_d (bitmap_t &bm, uint32_t max_bit)
{
   uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
   uint64_t overdraw = 0;

   bm.set_all(0);

   for (uint32_t i = 1; i <= i_max; ++i)
   {
      uint32_t n = 2 * i + 1;     
      uint32_t k = n * i + i;   // <= max_bit because that's how we computed i_max

      for ( ; k <= max_bit; k += n)
      {        
         overdraw += bm.bts(k);
      }
   }

   return overdraw;
}

With things looking the way they do, it's time to analyse whether a certain obvious small change is warranted by the math. The proof is left as an exercise for the reader...

uint64_t Sundaram31_e (bitmap_t &bm, uint32_t max_bit)
{
   uint32_t i_max = unsigned(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
   uint64_t overdraw = 0;

   bm.set_all(0);

   for (uint32_t i = 1; i <= i_max; ++i)
   {
      if (bm.bt(i))  continue;

      uint32_t n = 2 * i + 1;     
      uint32_t k = n * i + i;   // <= m because we computed i_max to get this bound

      for ( ; k <= max_bit; k += n)
      {        
         overdraw += bm.bts(k);
      }
   }

   return overdraw;
}

The only thing that still differs from classic odds-only Eratosthenes (apart from the name) is the initial value for k, which is normally (n * n) / 2 for the old Greek. However, substituting 2 * i + 1 for n the difference turns out to be 1/2, which rounds to 0. Hence, Sundaram's is odds-only Eratosthenes without the 'optimisation' of skipping composites to avoid at least some crossings-out of already crossed-out numbers. The value for i_max is the same as the Greek's max_factor_bit, just arrived at using completely different logical steps and computed using a marginally different formula.

P.S.: after seeing overdraw in the code so many times, people will probably want to know what it actually is... Sieving the numbers up to 2^32-1 (i.e. a full 2^31 bit bitmap) Sundaram's has an overdraw of 8,643,678,027 (roughly 2 * 2^32) or 444.6%; with the small fix that turns it into odds-only Eratosthenes the overdraw becomes 2,610,022,328 or 134.2%.

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
0

The sieve that avoid calculating redundant composites is the sieve of Marouane, check it out it may be faster the Sundaram Sieve