Some rabid C++ programmer claims his program only takes 8 seconds.
Is that wall-clock time or CPU time?
If wall-clock, and the task is split across 100 CPUs, say, it's not very impressive (it's decent), if split across 1000, it's pitiful.
If it's CPU time:
I'm pretty sure that time is not reached by actually sieving up to 1011.
With a few more than 4×109 primes until then, assuming a somewhat normal 2-3GHz CPU, you'd have 4-6 cycles per prime.
You cannot achieve that with a sieve of Eratosthenes, nor with a sieve of Atkin. Each prime has to be inspected and counted, each composite marked as such and inspected. That gives a theoretical lower bound of two cycles per number in the sieve, not counting e.g. array initialisation, loop bound checking, loop variable updates, redundant markings. You're not going to come near that theoretical bound.
A few data points:
Daniel Bernstein's primegen (Sieve of Atkin), with the sieving blocks adjusted to take full advantage of my 32KB L1-cache, takes 90 seconds to sieve the primes to 1011 and count them (234 seconds with the default sieve-block size of 8K words) on my Core i5 2410M (2.3GHz). (It's much optimised for the range up to 232, but above that, it becomes noticeably slower, for the limit 109 the times are 0.49 resp 0.64 seconds.)
My segmented Sieve of Eratosthenes, using some not exposed internals to avoid list creation, sieves and counts to 1011 in 340 seconds (sniff :-/, but hey, for 109 it took 2.92 seconds - it's getting closer, and somewhere between 1012 and 1013 it overtakes primegen :) Using the exposed interface creating a list of primes roughly doubles the time taken, as does compiling it with a 32-bit GHC.
So I'd wager that the reported time of 8 seconds - if CPU time - is, if correct, for an algorithm counting the number of primes without actually sieving the whole way. As indicated by applicative's answer, that can be done much faster.
dafis@schwartz:~/Haskell/Repos/arithmoi> time tests/primeCount 100000000000
4118054813
real 0m0.145s
user 0m0.139s
sys 0m0.006s
Note that 10^11 is too big for 32-bit arithmetic. Also, 10^11 bits = 12.5 GB, far too much data to fit into Haskell's 32-bit address space. So you can't have the entire bitmap in memory at once.
To sieve that range, you have to use a segmented sieve. Even if you're not restricted by a 32-bit address space, using such a large array will yield abysmal performance due to frequent cache misses. Your programme will spend most of its time to wait for data being transferred from main memory. Sieve in chunks that fit in your L2-cache (I haven't succeeded in trying to make it faster by making the sieve fit in L1, I guess the overhead of the GHC runtime is too large to make it work).
Also, eliminate the multiples of some small primes from the sieve, that reduces the needed work, and additionally improves performance by making the sieve smaller. Eliminating even numbers is trivial, multiples of 3 easy, multiples of 5 not very difficult.
Finally, note that the number of primes less than 10^11 is just a shade less than 2^32, so there's no way you can store the actual integers all at once either.
If you store the sieve as a list of bit-arrays, withe multiples of 2, 3 and 5 removed, you need about 3.3GB to store the chunks, so if you really can have up to 4GB, it would fit. But you should rather let the chunks you don't need anymore be garbage-collected immediately.
(In case it matters, my current implementation uses IOUArray Integer Bool and multiple threads to process independent subranges of the search space. Currently it takes several seconds to remove all the multiples of 2 from a 10MB array chunk...)
It does matter.
- Use
Int
for the indices and unsafeRead
/unsafeWrite
to read and modify the array. Integer
computations are much slower than Int
computations, and the bounds-checking you get with readArray
/writeArray
really hurts.
- 10MB chunks are far too large, you lose cache-locality. Use chunks of a few hundred KB at most (L2 cache minus some space for other things needed).
- Still, it shouldn't take several seconds to remove multiples of 2 even with
Integer
indices, bounds-checking and 10MB chunks. Can we see your code?
Post-vacation update:
Eight minutes to sieve the primes up to 1011 is possible without deep wizardry. I don't see how going from one to four cores could yield an eightfold speedup, since there should be no cache-effects here, but whatever, it may be, without seeing the code, I can't investigate.
So let's take a look at your code.
First, an incorrectness:
vs <-
mapM
(\ start -> do
let block = (start, start + block_size)
v <- newEmptyMVar
writeChan queue $ do
say $ "New chunk " ++ show block
chunk <- chunk_new block
sieve_top base chunk
c <- chunk_count chunk
putMVar v c
return v
)
(takeWhile (< target) $ iterate (+block_size) base_max)
The numbers base_max + k*block_size
appear in two chunks each, if any of them is prime, that prime is counted twice, also you should cap the upper bound at target
.
Now to the performance aspect:
One thing that jumps out is that it's real chatty, so chatty that it's measurable once you have adjusted the block_size
to the cache (I took 256KB blocks for a 512KB L2 cache) - then the threads are slowed down by fighting for stdout
for the if prime < 100 then say $ "Sieve " ++ show prime else return ()
message.
Let's look at your (silenced) sieving loop:
chunk_sieve :: Chunk -> Integer -> IO ()
chunk_sieve array prime = do
(min, max) <- getBounds array
let n0 = min `mod` prime
let n1 = if n0 == 0 then min else min - n0 + prime
mapM_
(\ n -> writeArray array n (n == prime))
(takeWhile (<= max) $ iterate (+prime) n1)
One thing that costs time is that each index is compared to the prime whose multiples are marked off. Each single comparison is cheap (though considerably more expensive than an Int
comparison), but the huge number of comparisons, only one of which may yield True
, adds up. Unconditionally writing False
and if necessary writing True
at the prime's index after the loop yields a considerable speedup.
For timing purposes I've reduced the target to 109 and ran it on two cores. The original code took 155s (elapsed, 292s user), with the reduced block_size
148s, silenced 143s. Omitting the comparison,
mapM_
(\ n -> writeArray array n False)
(takeWhile (<= max) $ iterate (+prime) n1)
when (min <= prime && prime <= max) $ writeArray array prime True
it runs in 131s.
Now it's time for some bigger speedups. Did I already mention that bounds-checking costs a lot of time? Since the loop condition guarantees that no out-of-bounds access is attempted (and the primes are small enough that no Int
-overflow can happen), we should really use the unchecked access:
chunk_sieve :: Chunk -> Integer -> IO ()
chunk_sieve array prime = do
(min, max) <- getBounds array
let n0 = min `mod` prime
n1 = if n0 == 0 then min else min - n0 + prime
n2 = fromInteger (n1 - min)
mx = fromInteger (max - min)
pr = fromInteger prime
mapM_
(\ n -> unsafeWrite array n False)
(takeWhile (<= mx) $ iterate (+pr) n2)
when (min <= prime && prime <= max) $ writeArray array prime True
which reduces the running time to 96s. Much better, but still abysmal. The culprit is
takeWhile (<= mx) $ iterate (+pr) n2
GHC can't fuse that composition well, and you get a list of boxed Int
s that is traversed. Replace that with an arithmetic sequence, [n2, n2+pr .. mx]
and GHC happily creates a loop using unboxed Int#
s, 37 seconds.
Much much better, but still bad. The biggest time-consumer now is
chunk_count :: Chunk -> IO Integer
chunk_count array = do
(min, max) <- getBounds array
work min max 0
where
work i max count = do
b <- readArray array i
let count' = count + if b then 1 else 0
evaluate count'
let i' = i+1
if i' > max
then return count'
else work i' max count'
Again, the bounds-checking costs a lot of time. With
chunk_count :: Chunk -> IO Integer
chunk_count array = do
(min, max) <- getBounds array
work 0 (fromInteger (max-min)) 0
where
work i max count = do
b <- unsafeRead array i
let count' = count + if b then 1 else 0
evaluate count'
let i' = i+1
if i' > max
then return count'
else work i' max count'
we're down to 15 seconds. Now, evaluate count'
is a somewhat expensive way to make work
strict in count
. Using else work i' max $! count'
in the last line instead of evaluate
reduces the running time to 13 seconds. Defining work
in a more suitable (for GHC, at least) way,
chunk_count :: Chunk -> IO Integer
chunk_count array = do
(min, max) <- getBounds array
let mx = fromInteger (max-min)
work i !ct
| mx < i = return ct
| otherwise = do
b <- unsafeRead array i
work (i+1) (if b then ct+1 else ct)
work 0 0
brings the time down to 6.55 seconds. Now we're in a situation where say $ "New chunk " ++ show block
makes a measurable difference, disabling that gets us down to 6.18 seconds.
However, counting set bits by reading a byte from the array, masking off the undesired bits and comparing to 0 for each individual bit is not the most efficient way. It's faster to read entire Word
s from the array (via castIOUArray) and use popCount, if "you know what you're doing...", that gets us down to 4.25 seconds; stopping the marking when the square of the prime becomes larger than the upper bound of the chunk
sieve_top :: Chunk -> Chunk -> IO ()
sieve_top base chunk = work 2
where
work prime = do
chunk_sieve chunk prime
mp <- chunk_next_prime base prime
case mp of
Nothing -> return ()
Just p' -> do
(_,mx) <- getBounds chunk
when (p'*p' <= mx) $ work p'
to 3.9 seconds. Still not spectacular, but considering where we started, not bad. Just to illustrate the importance of cache locality once other bad behaviour has been reduced: the same code with the original 10MB block size takes 8.5 seconds.
Another small problem in your code is that all threads use the same mutable array of small primes for sieving. Since it is mutable, access to that must be synchronised, which adds a bit of overhead. With only two threads, the overhead isn't too big, using an immutable copy to do the sieving only reduces the time to 3.75 seconds here, but I expect that the effect would be larger for more threads. (I have only two physical cores - with hyperthreading - so using more than two threads doing the same kind of work introduces a slowdown that may invalidate conclusions drawn from that, but using four threads, I get 4.55 seconds with the immutable array versus 5.3 seconds with the mutable array. That seems to corroborate the growing synchronisation overhead.)
There's still a bit to be gained by eliminating more Integer
calculations and writing code for GHC's optimiser (more worker/wrapper transformations, some static argument transformations), but not very much, maybe 10-15%.
The next big improvement is to be obtained by eliminating even numbers from the sieve. That reduces the work, allocation and running time by more than half. No prime sieve should ever consider even numbers, really, that's just a pointless waste of work.