Why is Sieve of Eratosthenes more efficient than the simple "dumb" algorithm?

Question

If you need to generate primes from 1 to N, the "dumb" way to do it would be to iterate through all the numbers from 2 to N and check if the numbers are divisable by any prime number found so far which is less than the square root of the number in question.

As I see it, sieve of Eratosthenes does the same, except other way round - when it finds a prime N, it marks off all the numbers that are multiples of N.

But whether you mark off X when you find N, or you check if X is divisable by N, the fundamental complexity, the big-O stays the same. You still do one constant-time operation per a number-prime pair. In fact, the dumb algorithm breaks off as soon as it finds a prime, but sieve of Eratosthenes marks each number several times - once for every prime it is divisable by. That's a minimum of twice as many operations for every number except primes.

Am I misunderstanding something here?

Daniel Fischer · Accepted Answer · 2012-01-05T18:52:59.280

In the trial division algorithm, the most work that may be needed to determine whether a number n is prime, is testing divisibility by the primes up to about sqrt(n).

That worst case is met when n is a prime or the product of two primes of nearly the same size (including squares of primes). If n has more than two prime factors, or two prime factors of very different size, at least one of them is much smaller than sqrt(n), so even the accumulated work needed for all these numbers (which form the vast majority of all numbers up to N, for sufficiently large N) is relatively insignificant, I shall ignore that and work with the fiction that composite numbers are determined without doing any work (the products of two approximately equal primes are few in number, so although individually they cost as much as a prime of similar size, altogether that's a negligible amount of work).

So, how much work does the testing of the primes up to N take?

By the prime number theorem, the number of primes <= n is (for n sufficiently large), about n/log n (it's n/log n + lower order terms). Conversely, that means the k-th prime is (for k not too small) about k*log k (+ lower order terms).

Hence, testing the k-th prime requires trial division by pi(sqrt(p_k)), approximately 2*sqrt(k/log k), primes. Summing that for k <= pi(N) ~ N/log N yields roughly 4/3*N^(3/2)/(log N)^2 divisions in total. So by ignoring the composites, we found that finding the primes up to N by trial division (using only primes), is Omega(N^1.5 / (log N)^2). Closer analysis of the composites reveals that it's Theta(N^1.5 / (log N)^2). Using a wheel reduces the constant factors, but doesn't change the complexity.

In the sieve, on the other hand, each composite is crossed off as a multiple of at least one prime. Depending on whether you start crossing off at 2*p or at p*p, a composite is crossed off as many times as it has distinct prime factors or distinct prime factors <= sqrt(n). Since any number has at most one prime factor exceeding sqrt(n), the difference isn't so large, it has no influence on complexity, but there are a lot of numbers with only two prime factors (or three with one larger than sqrt(n)), thus it makes a noticeable difference in running time. Anyhow, a number n > 0 has only few distinct prime factors, a trivial estimate shows that the number of distinct prime factors is bounded by lg n (base-2 logarithm), so an upper bound for the number of crossings-off the sieve does is N*lg N.

By counting not how often each composite gets crossed off, but how many multiples of each prime are crossed off, as IVlad already did, one easily finds that the number of crossings-off is in fact Theta(N*log log N). Again, using a wheel doesn't change the complexity but reduces the constant factors. However, here it has a larger influence than for the trial division, so at least skipping the evens should be done (apart from reducing the work, it also reduces storage size, so improves cache locality).

So, even disregarding that division is more expensive than addition and multiplication, we see that the number of operations the sieve requires is much smaller than the number of operations required by trial division (if the limit is not too small).

Summarising:
Trial division does futile work by dividing primes, the sieve does futile work by repeatedly crossing off composites. There are relatively few primes, but many composites, so one might be tempted to think trial division wastes less work.
But: Composites have only few distinct prime factors, while there are many primes below sqrt(p).

also by multiplying you can get rid of the slow division away — phuclv, Mar 04 '14 at 08:01

IVlad · Answer 2 · 2011-04-13T12:40:24.480

11

In the naive method, you do O(sqrt(num)) operations for each number num you check for primality. Ths is O(n*sqrt(n)) total.

In the sieve method, for each unmarked number from 1 to n you do n / 2 operations when marking multiples of 2, n / 3 when marking those of 3, n / 5 when marking those of 5 etc. This is n*(1/2 + 1/3 + 1/5 + 1/7 + ...), which is O(n log log n). See here for that result.

So the asymptotic complexity is not the same, like you said. Even a naive sieve will beat the naive prime-generation method pretty fast. Optimized versions of the sieve can get much faster, but the big-oh remains unchanged.

The two are not equivalent like you say. For each number, you will check divisibility by the same primes 2, 3, 5, 7, ... in the naive prime-generation algorithm. As you progress, you check divisibility by the same series of numbers (and you keep checking against more and more as you approach your n). For the sieve, you keep checking less and less as you approach n. First you check in increments of 2, then of 3, then 5 and so on. This will hit n and stop much faster.

edited Apr 13 '11 at 12:40

answered Apr 13 '11 at 12:35

IVlad

43,099
13
111
179

No, it's not `O(sqrt(num))`, because I don't check every number - just the primes that have been found so far. Read carefully. – Vilx- Apr 13 '11 at 12:43
@Vilx - for each number `num`, you must check if it's divisible by any of your current primes that are `<= sqrt(num)`. There are roughly `x / log(x) = O(x)` primes `<= x`. That means you do `sqrt(num) / log(sqrt(num)) = O(sqrt(num))` operations for each number. Think about it. You might have a favorable constant (`< 1`) behind that big-oh, but it's still bad as `n` grows. – IVlad Apr 13 '11 at 12:45
If, as you say, you only check the primes found so far (ALL of them), then your algorithm is actually `O(n^2)` :). You need to check all the primes found so far that are `<= sqrt(num)`. Then it's `O(n sqrt(n))` like I said. – IVlad Apr 13 '11 at 12:49
1

I'm pretty sure that `O(x/log x) < O(x)`. – Gabe Apr 13 '11 at 13:05
@Gabe - `O(x)` is an upper bound for `x / log x`. Do you know a tighter bound? It's not `O(log x)`, because the limit of `log x / (x / log x)` as `x` approaches infinity is 0. Even if it were `O(log x)`, that's still worse than the sieve. If you have a better bound, I'll edit it in, but my point still stands either way. You could write it as `O(x / log x)` if you wanted, but it doesn't really tell you more. – IVlad Apr 13 '11 at 13:20
2

@IVlad when comparing algorithms, it's not fair to round one of their time complexities up to a convenient upper-bound, while keeping the other one's bound tight. The naive method takes O(n*sqrt(n)/log(n)). Your overall point is correct, since O(n log log n) is better than O(n*sqrt(n)/log n), which is easy to see by plotting (sqrt(n)/log n) / log log n. – Daniel Stutzbach May 24 '11 at 13:27

ypercubeᵀᴹ · Answer 3 · 2011-04-13T13:29:20.530

Because with the sieve method, you stop marking mutiples of the running primes when the running prime reaches the square root of N.

Say, you want to find all primes less than a million.

First you set an array

for i = 2 to 1000000
  primetest[i] = true

Then you iterate

for j=2 to 1000         <--- 1000 is the square root of 10000000
  if primetest[j]                                    <--- if j is prime
    ---mark all multiples of j (except j itself) as "not a prime"
    for k = j^2 to 1000000 step j
      primetest[k] = false

You don't have to check j after 1000, because j*j will be more than a million. And you start from j*j (you don't have to mark multiples of j less than j^2 because they are already marked as multiples of previously found, smaller primes)

So, in the end you have done the loop 1000 times and the if part only for those j's that are primes.

Second reason is that with the sieve, you only do multiplication, not division. If you do it cleverly, you only do addition, not even multiplication.

And division has larger complexity than addition. The usual way to do division has O(n^2) complexity, while addition has O(n).

Eh? That's usable when testing N for primality, not when generating all primes smaller than N. The Sieve of Eratosthenes is primarily a prime generator, not a prime tester. — Vatine, Apr 13 '11 at 12:26
@Vatine - no, it works for the sieve too. You only need to mark multiples of primes `<= sqrt (n)` — IVlad, Apr 13 '11 at 12:30

score 5 · Answer 4 · answered Apr 13 '11 at 12:26

5

Explained in this paper: http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf

I think it's quite readable even without Haskell knowledge.

answered Apr 13 '11 at 12:26

Landei

54,104
13
100
195

score 4 · Answer 5 · answered Apr 13 '11 at 12:24

4

the first difference is that division is much more expensive than addition. Even if each number is 'marked' several times, it's trivial when compared with the huge number of divisions needed for the 'dumb' algorithm.

answered Apr 13 '11 at 12:24

Javier

60,510
8
78
126

Still, as we know from the teachings of the Big O - with large enough N this difference will become unimportant. – Vilx- Apr 13 '11 at 12:46
@Vilx: No, it won't because division has larger complexity than addition. – ypercubeᵀᴹ Apr 13 '11 at 13:27
While I agree that division does take longer than addition (though I'm not sure by how much on today's CPUs), don't you think that both operations are constant-time relative to the algorithms in discussion? – Vilx- Apr 13 '11 at 13:43
3

This has nothing to do with it, both division and addition are done in constant time (assuming fixed-width ints). – IVlad Apr 13 '11 at 13:53
@IVlad: @Vilx: constant time for small numbers that fit into 32 or 64bit integers. When you want to find bigger primes (or **when you want to do big O analysis**), no it's not constant time... – ypercubeᵀᴹ Apr 13 '11 at 14:40
2

@ypercube - I said assuming fixed-width ints. There's nothing stopping us from putting constant upper bounds on things while doing big O analysis. So yes, it is constant under these circumstances. – IVlad Apr 13 '11 at 14:58
1

@IVlad: I stand by my point: **You can't have constant upper bound on the size of numbers and do big O analysis at the same time**. If you do insist on that, why don't you change your answer to: "Both ways have O(1) complexity" ? – ypercubeᵀᴹ Apr 13 '11 at 15:17
@ypercube - I am only putting a constant upper bound on the number of bits needed to represent the numbers used in the algorithms. I am putting no bound on `n`, which is the actual input size that I care about in the big O. So you can't say they're `O(1)`, because `n` isn't a constant. Haven't you ever solved limits such as `lim of c1*f(x) / c2*g(x) when x -> infinity and c1, c2 real constants`? Just because some things that appear in the limit are constants, doesn't mean everything is. It's the same for big O. – IVlad Apr 13 '11 at 15:26
@IVlad: Yes, I understand wht you are trying to say. But in this special problem we are dealing with, consider this: You say you have a bound on the number of bits needed to represent the numbers used in the algorithms. And no bound on `n` itself. But one of the numbers used in the algorithm is `sqrt(n)` which has half the bits of `n`. How do you bypass that? – ypercubeᵀᴹ Apr 13 '11 at 15:35
3

@ypercub - ah, I see what you mean now. The problem is that `n` is also part of the numbers I was limiting, so by limiting the numbers I'm implicitly limiting `n` as well. – IVlad Apr 13 '11 at 17:02
2

If you're doing a Sieve on numbers larger than 2^64, well, good luck. – Nick Johnson Apr 14 '11 at 00:24
1

Integer division and addition are same speed on modern CPUs. Regarding larger than 64 bits: Sure, division will be slower than addition THEN, but you would never have enough RAM or even disk space to house a sieve anyway with numbers that large, for comparison. It turns out that in practice, brute-force division in a tight loop is faster than sieve creation (as long as you only test-divide by primes up to the square root), even for an extremely optimized sieve using packed bitmasks. What kills you on speed with the sieve is memory latency. – Todd Lehman Dec 20 '12 at 03:24
1

@Todd: Integer division is about 25 times slower than addition on current Intel/AMD CPUs. See http://agner.org/optimize/ for instruction latency and throughput. Worst case for Intel Haswell, `div` (unsigned) 128b/64b->64b can have a throughput as slow as one per 74 cycles (with up to 96cycle latency). Scalar int `add` (64b+64b->64b) has a throughput of 4 per clock cycle, with 1 cycle latency. `div r32` (64b/32b->32b) is a lot less bad: one per 9 to 11 cycle throughput, 22 to 29c latency. **You might still be right about sieve larger than cache losing, though. A cache miss costs > 100c**. – Peter Cordes Aug 29 '15 at 20:48
1

@PeterCordes — Thanks for the correction. Wow, I was so naive when I made that statement (Dec 2012). I know now that integer division is insanely slow, having carefully benchmarked it on a bunch of processors (ARM A5 through A8 and also Intel i5 & i7). A couple of interesting tricks: (1) 2-adic inverse multiplication is ~10x faster than division for doing divisor tests. (2) Divisor testing using floating-point division/floor can be up to 10x faster as well on 32-bit ARM devices, when dividing an n-bit integer by an m-bit integer, where 0 < m,n <= 53. – Todd Lehman Aug 29 '15 at 23:31
1

@ToddLehman: Intel has faster FP div than integer div, too, esp. on Skylake. You could maybe speed up trial-division with vector FP. Broadcast your number to test into all elements, then you can divide/floor/mul/sub/test 2, 4, or 8 values in parallel. (128b vectors of double up to 256b vectors float). The mul/sub step can be done with an FMA instruction. http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake_InstLatX64.txt says skylake can vdivps (packed single) one per 5c, or vdivpd (packed double) one per 8c. This is significantly more heavily pipelined than on Haswell, for 256b. – Peter Cordes Aug 29 '15 at 23:41
1

@PeterCordes — Interesting about the vector idea. I find that a cache-friendly memory-mapped table of 2-adic multiplicative inverses of primes (64-bit multiplier + 64-bit comparator) is wicked fast compared to integer or floating-point division by 32-bit primes (when benchmarked as a black box test of trial division to factor 64-bit unsigned integers), and it only needs to run up to the cube root of the number before breaking out SQUFOF or Hart's OLF or Brent-Pollard Rho. I wonder how much faster inverse multiplication would be as a vector operation, since the bottleneck is RAM access. – Todd Lehman Aug 30 '15 at 00:52
1

@ToddLehman: Yeah, vector multiplication should be excellent. If you have an array of multiplicative-inverses and a separate array of comparators, you can load a vector of them with no shuffling. You can probably get something like 1 packed multiply, compare, and branch-if-any-are-equal per clock, for 128b or 256b vectors, assuming the data is coming from L1 cache... If not, generating the multiplicative inverses on the fly might win, if there's some compressed way to store needed info. The biggest hitch is that SSE/AVX2 only has 32x32->64, or 32x32->32 multiplies, not packed 64bit int. – Peter Cordes Aug 30 '15 at 01:52
1

@ToddLehman: oops, misread the docs. `PMULUDQ` is a regular 64bit unsigned multiply of packed vector elements. The 64bit results only depends on the low32 of the source elements. It's in SSE2, which is baseline for 64bit CPUs. – Peter Cordes Aug 30 '15 at 02:00
1

If cache / mem BW is the limit, maybe a method where you don't also have to store a comparator will be best. e.g. FP division, where you check if you found an exact divisor by doing the multiplication. (maybe in the integer domain). – Peter Cordes Aug 30 '15 at 02:06

score 0 · Answer 6 · answered Apr 13 '11 at 12:24

0

A "naive" Sieve of Eratosthenes will mark non-prime numbers multiple times. But, if you have your numbers on a linked list and remove numbers taht are multiples (you will still need to walk the remainder of the list), the work left to do after finding a prime is always smaller than it was before finding the prime.

answered Apr 13 '11 at 12:24

Vatine

20,782
4
54
70

5

Having the numbers in a linked list is a horrible idea. Linked lists don't allow random access, which is crucial for the sieve's performance. – IVlad Apr 13 '11 at 12:27
You take the head of the linked list, this is your next prime, you then traverse the whole remaining linked list, deleting all values that are a multiple of your prime (you can simply add the prime to an accumulator, if you want to skip trial division and you probably want to do that). Then, as long as you have a linked list, you take the head of the list as your prime, ... Quick trial implementation and some benchmarking seems to indicate it is no worse than using an array being marked off. – Vatine Apr 13 '11 at 12:58
4

You're comparing it with the naive sieve implementation. Why use an int (you have to store the actual number) and a pointer when all you need is a single bit to tell you if the number corresponding to that bit is prime or not? If you use arrays, you only need a single bit for each number. How does your solution scale if I want to add a wheel (described in @Landei's link. Basically, if I want to start from 7 and ignore multiples of 2, 3 and 5). Not to mention the overhead of using lists. Besides, your answer implies that using linked lists is better than the "naive" sieve, which isn't true. – IVlad Apr 13 '11 at 13:05
1

@IVlad another way to say this, is that the proper S. of E. is like no-duplicates pigeonhole sort: we are able to directly place a number into an array right into its proper place, using the number as an address. If numbers are actually removed on each step, direct addressing is no longer possible. – Will Ness Aug 07 '12 at 19:43
1

@IVlad: Actually, random access is not crucial for a sieve's performance. In fact, random access KILLS a sieve's performance. Random access to RAM kills performance in general, as it kills cache coherency. Sequential access to an array is orders of magnitude faster than random access. – Todd Lehman Dec 20 '12 at 03:28
2

@ToddLehman - in data structures, random access means the ability to access an entry in constant time, which is needed for an efficient sieve. The access itself is sequential in that it is predictable at each step. – IVlad Dec 20 '12 at 08:36
@IVlad — Sure, that's what it means in theory. But in practice, you cannot access an array in memory in constant time unless it is extremely small. When filling a sieve, what you want to do is break it into segments that each fit into the processor's L1 cache (for example, 16KB segments). Within the L1 cache area, you can access any array element in constant time, but when you leave that area, you pay a penalty for swapping in new area. What you absolutely do not want to do is sweep through the entire array for every prime being marked off. That kills performance. – Todd Lehman Dec 20 '12 at 17:10
1

@ToddLehman: How about, rather than marking off every multiple of each found prime P, one reads each element of the array I in reverse order, starting at floor(N/P), and only marks element I*P when I is potentially prime? As P gets larger, the number of elements that would need to be read would shrink until it could fit in the L1 cache, and the density of elements that would need to be marked would diminish. – supercat Feb 19 '15 at 21:12

ninjagecko · Answer 7 · 2011-07-14T04:56:11.607

0

http://en.wikipedia.org/wiki/Prime_number#Number_of_prime_numbers_below_a_given_number

the "dumb" algorithm does i/log(i) ~= N/log(N) work for each prime number
the real algorithm does N/i ~= 1 work for each prime number

Multiply by roughly N/log(N) prime numbers.

edited Jul 14 '11 at 04:56

answered Jul 14 '11 at 04:20

ninjagecko

88,546
24
137
145

1

I doubt it. If, say, 640000000 is divisible by 2, there's no need to check the rest, to i/log(i) must be a loose bound here. – zw324 Jul 14 '11 at 04:26
@Ziyao: hmm, very insightful! I now have a gut feeling that the brunt of the work in the "dumb" algorithm must come from confirming that a prime is a prime. I have rewritten my answer based on this insight. Yes I am using quite terrible bounds, if you have better ones please suggest. – ninjagecko Jul 14 '11 at 04:56

score 0 · Answer 8 · 2021-09-14T17:37:27.430

It was a time when I was trying to find an efficient way of finding sum of primes less than x:

There I decided to use N by N square table and started checking if numbers with a unit digits in [1,3,7,9]

But Eratosthenes Method of prime made it a little easier: How

Let you want to know if N is prime or Not

You started finding factorization. So you will realize when N is factorized when you divide N with the highest factor quotient will be less.

So, You take a number: int(sqrt(N)) = K(say) divides N you get somewhat same and close number to K
Now let's say you divide N with u<K, but if "U" is not prime and one of the prime factors of U is V(prime) then will obviously be less than U (V<U) and V will also divide N then

why not divide and check if N is prime or not by DIVIDING 'N' WITH ONLY PRIMES LESS THAN K=int(sqrt(N)) Number of times for which loop Keeps executing = π(√n) This is how the brilliant idea of Eratosthenes starts taking pictures and will start giving you intuition behind this all.
Btw using the Sieve of Eratosthenes one can find sum of primes less than a multiple of 10.

because for a given column you just check need to check their unit digits[1,3,7,9] and for how many times a particular unit digit is repeating.

Being new to Stack Overflow Community! Would like to know suggestions on the same if anything is wrong.

Why is Sieve of Eratosthenes more efficient than the simple "dumb" algorithm?

8 Answers8

Linked