How to optimize this Haskell code summing up the primes in sublinear time?

Question

Problem 10 from Project Euler is to find the sum of all the primes below given n.

I solved it simply by summing up the primes generated by the sieve of Eratosthenes. Then I came across much more efficient solution by Lucy_Hedgehog (sub-linear!).

For n = 2⋅10^9:

Python code (from the quote above) runs in 1.2 seconds in Python 2.7.3.
C++ code (mine) runs in about 0.3 seconds (compiled with g++ 4.8.4).

I re-implemented the same algorithm in Haskell, since I'm learning it:

import Data.List

import Data.Map (Map, (!))
import qualified Data.Map as Map

problem10 :: Integer -> Integer
problem10 n = (sieve (Map.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
              where vs = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
                    r  = floor (sqrt (fromIntegral n))

sieve :: Map Integer Integer -> Integer -> Integer -> [Integer] -> Map Integer Integer
sieve m p r vs | p > r     = m
               | otherwise = sieve (if m ! p > m ! (p - 1) then update m vs p else m) (p + 1) r vs

update :: Map Integer Integer -> [Integer] -> Integer -> Map Integer Integer
update m vs p = foldl' decrease m (map (\v -> (v, sumOfSieved m v p)) (takeWhile (>= p*p) vs))

decrease :: Map Integer Integer -> (Integer, Integer) -> Map Integer Integer
decrease m (k, v) = Map.insertWith (flip (-)) k v m

sumOfSieved :: Map Integer Integer -> Integer -> Integer -> Integer
sumOfSieved m v p = p * (m ! (v `div` p) - m ! (p - 1))

main = print $ problem10 $ 2*10^9

I compiled it with ghc -O2 10.hs and run with time ./10.

It gives the correct answer, but takes about 7 seconds.

I compiled it with ghc -prof -fprof-auto -rtsopts 10 and run with ./10 +RTS -p -h.

10.prof shows that decrease takes 52.2% time and 67.5% allocations.

After running hp2ps 10.hp I got such heap profile:

Again looks like decrease takes most of the heap. GHC version 7.6.3.

How would you optimize run time of this Haskell code?

Update 13.06.17:

I tried replacing immutable Data.Map with mutable Data.HashTable.IO.BasicHashTable from the hashtables package, but I'm probably doing something bad, since for tiny n = 30 it already takes too long, about 10 seconds. What's wrong?

Update 18.06.17:

Curious about the HashTable performance issues is a good read. I took Sherh's code using mutable Data.HashTable.ST.Linear, but dropped Data.Judy in instead. It runs in 1.1 seconds, still relatively slow.

Could you please split the algorithm up to reasonably sized top level functions which can be reasoned about (seing types, etc. helps...) ? Thanks. — Centril, Jun 08 '17 at 17:06
@Centril I factored out `sieve`, `update`, `decrease` and `sumOfsieved`, hope it helps. — Adam Stelmaszczyk, Jun 08 '17 at 17:27
How much of your algorithm is dominated by modifying Maps, and how much by reading? If you read a lot, then using http://hackage.haskell.org/package/vector-0.12.0.1/docs/Data-Vector-Unboxed.html is great for numeric code. There's also http://hackage.haskell.org/package/array — Centril, Jun 08 '17 at 17:32
@ThomasM.DuBuisson Thank you, I also changed `Integer` to `Int`, code [here](https://gist.github.com/AdamStelmaszczyk/52e3cf83322ac5f88aeb9e18dd068348). It brought run-time down to 5 seconds. — Adam Stelmaszczyk, Jun 08 '17 at 17:51
I wonder if using `Data.IntMap.Strict` would decrease the number of allocations; I suspect that subtraction in `insertWith` mostly generates thunks. — 9000, Jun 08 '17 at 18:04
@Centril There are 2 loops there, outer one iterating through possible primes p and inner one iterating through sufficient "check-point" integers for DP. It may be easier to look into Python [code](https://math.stackexchange.com/questions/1378286/find-the-sum-of-all-primes-smaller-than-a-big-number/2283829#2283829) to see that. In the outer loop there are 5 reads (from keys `p`, `p - 1`, `p - 1` again, `v`, `v // p`) and 1 write, in the inner loop there 3 reads and 1 write. Thanks for the links, I'm looking into them. — Adam Stelmaszczyk, Jun 08 '17 at 18:05

Shersh · Accepted Answer · 2017-06-19T13:36:19.657

7

I've done some small improvements so it runs in 3.4-3.5 seconds on my machine. Using IntMap.Strict helped a lot. Other than that I just manually performed some ghc optimizations just to be sure. And make Haskell code more close to Python code from your link. As a next step you could try to use some mutable HashMap. But I'm not sure... IntMap can't be much faster than some mutable container because it's an immutable one. Though I'm still surprised about it's efficiency. I hope this can be implemented faster.

Here is the code:

import Data.List (foldl')
import Data.IntMap.Strict (IntMap, (!))
import qualified Data.IntMap.Strict as IntMap

p :: Int -> Int
p n = (sieve (IntMap.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
               where vs = [n `div` i | i <- [1..r]] ++ [n', n' - 1 .. 1]
                     r  = floor (sqrt (fromIntegral n) :: Double)
                     n' = n `div` r - 1

sieve :: IntMap Int -> Int -> Int -> [Int] -> IntMap Int
sieve m' p' r vs = go m' p'
  where
    go m p | p > r               = m
           | m ! p > m ! (p - 1) = go (update m vs p) (p + 1)
           | otherwise           = go m (p + 1)

update :: IntMap Int -> [Int] -> Int -> IntMap Int
update s vs p = foldl' decrease s (takeWhile (>= p2) vs)
  where
    sp = s ! (p - 1)
    p2 = p * p
    sumOfSieved v = p * (s ! (v `div` p) - sp)
    decrease m  v = IntMap.adjust (subtract $ sumOfSieved v) v m

main :: IO ()
main = print $ p $ 2*10^(9 :: Int)

UPDATE:

Using mutable hashtables I've managed to make performance up to ~5.5sec on Haskell with this implementation.

Also, I used unboxed vectors instead of lists in several places. Linear hashing seems to be the fastest. I think this can be done even faster. I noticed sse42 option in hasthables package. Not sure I've managed to set it correctly but even without it runs that fast.

UPDATE 2 (19.06.2017)

I've managed to make it 3x faster then best solution from @Krom (using my code + his map) by dropping judy hashmap at all. Instead just plain arrays are used. You can come up with the same idea if you notice that keys for S hashmap are either sequence from 1 to n' or n div i for i from 1 to r. So we can represent such HashMap as two arrays making lookups in array depending on searching key.

My code + Judy HashMap

$ time ./judy
95673602693282040

real    0m0.590s
user    0m0.588s
sys     0m0.000s

My code + my sparse map

$ time ./sparse
95673602693282040

real    0m0.203s
user    0m0.196s
sys     0m0.004s

This can be done even faster if instead of IOUArray already generated vectors and Vector library is used and readArray is replaced by unsafeRead. But I don't think this should be done if only you're not really interested in optimizing this as much as possible.

Comparison with this solution is cheating and is not fair. I expect same ideas implemented in Python and C++ will be even faster. But @Krom solution with closed hashmap is already cheating because it uses custom data structure instead of standard one. At least you can see that standard and most popular hash maps in Haskell are not that fast. Using better algorithms and better ad-hoc data structures can be better for such problems.

Here's resulting code.

edited Jun 19 '17 at 13:36

answered Jun 09 '17 at 10:37

Shersh

9,019
3
33
61

Thanks, I checked timings on my machine. Using `Map.Strict` gave 4s. Using `IntMap.Strict` 3.2s. Didn't change the time: helper function `go` with less arguments, `sp`, no `if`, `subtract`, no `map`, no `reverse`. But the last 4 make the code more readable IMO, so it's great to know, thanks. [Code](https://gist.github.com/AdamStelmaszczyk/e0e794e6392dd7ff9758fc64d5557622). My feeling at the moment is that to get much better performance one needs to use mutable data structure. I'm reading about mutable vectors, however it looks like understanding monads/arrows will take me some time. – Adam Stelmaszczyk Jun 09 '17 at 12:40
1

@AdamStelmaszczyk You could try to read this blog post. It might help you to come with imperative/mutable solution faster: https://www.reddit.com/r/haskell/comments/6e4wq8/imperative_haskell/ – Shersh Jun 09 '17 at 12:58
Thanks for the link, I updated my question. I [tried](https://gist.github.com/AdamStelmaszczyk/f9a5d156a752fbe6c71652866fed9770) using mutable `BasicHashTable`, but I'm probably doing something wrong, since it's very slow... – Adam Stelmaszczyk Jun 13 '17 at 20:33
1

@AdamStelmaszczyk I've updated answer with mutable version with runs rather fast. Just directly translated Python code from your link. – Shersh Jun 18 '17 at 14:31
1

If I try your updated program it gives me a wrong result of `318504960` – typetetris Jun 18 '17 at 15:12
@Krom Yes, you're right :( I was in hurry and forgot to subtract result. Now implementation works correctly. But runs `5.5sec` instead. This is because extra lookup in hashtable in most inner loop. If instead `update` function was used (which is, unfortunately, not implemented for hashtable) then I think this can be done faster. I'll try to implement `update` function manually later. – Shersh Jun 18 '17 at 15:35
I plugged my hashmap into your code, see [here](https://gist.github.com/erwo42/a455a80bfe1dac6203204665b71d48b6) with that it runs in 0.6 seconds on my machine. – typetetris Jun 19 '17 at 05:03
@Krom Nice! That's already faster than python implementation. – Shersh Jun 19 '17 at 09:03
I didn't test the python solution on my machine, as I am not proficient with python. Maybe you can check the python solution vs. our combined solution on your machine? (Otherwise, the numbers aren't comparable.) – typetetris Jun 19 '17 at 09:05
@Krom Yes, I will check both implementation on my machine. And, probably, I can optimize your solution as well. There's a room for performance boost as I can see. – Shersh Jun 19 '17 at 09:38
@Krom I timed combined solution, got 0.7s, well done, the fastest we got here so far. Running Python (assuming `python` is already installed, on a lot of distros it is by default, if not `sudo apt-get install python` should do it) is a quick thing, just `time python problem10.py`, where problem10.py is [this](https://gist.github.com/AdamStelmaszczyk/d46faa15ba0c6d1432fd1fcbf7cc8a85). With Python 2.7.3 I got 1.2s, with Python 3.4.3 I got 1.5s. – Adam Stelmaszczyk Jun 19 '17 at 10:32
@AdamStelmaszczyk I've make it even faster with another approach. See updated answer. – Shersh Jun 19 '17 at 13:30
@Shersh I don't consider my map to be cheating, as it can be expanded into a general purpose Int to Int map. Just add a safe lookup and some rehashing logic. – typetetris Jun 19 '17 at 14:48
1

@Shersh I applaud your clever solution and upvoted your answer! – typetetris Jun 19 '17 at 14:48
Could you please post the code to your Judy Hashmap solution? I tried to plug Data.Judy into your code and it didn't give me a sub 0.6 second timings. – typetetris Jun 19 '17 at 14:51
@Krom Regarding my code with Judy HashMap — I've just used your solution from gist above. Probably my machine is a little bit faster. – Shersh Jun 19 '17 at 15:05
@Krom Regarding cheating... Well, I thing your algo works better when key set is fixed or when we don't perform inserts and deletes often... And with rehashing it would be slower. But I agree it's fair. – Shersh Jun 19 '17 at 15:06
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147098/discussion-between-krom-and-shersh). – typetetris Jun 19 '17 at 16:23
1

I timed your sparse map - 0.2s, it's the fastest Haskell code given, well done. Others were also very helpful, thank you, but I can't split the bounty, so I think you should receive it. Out of curiosity, I [implemented](https://gist.github.com/AdamStelmaszczyk/2e8ab8c1fc57c166c1ba99b6fb9615c4) your idea in C++, I'm just storing everything in one array `S`, it runs in 0.15s. – Adam Stelmaszczyk Jun 22 '17 at 22:04
@AdamStelmaszczyk `hashtables` might perform quite badly. But arrays should be close in performance to C++. Though, I'm still surprised they are so close! – Shersh Jun 23 '17 at 07:57

typetetris · Answer 2 · 2017-06-19T11:41:29.333

First as a baseline, the timings of the existing approaches on my machine:

Original program posted in the question:

time stack exec primorig
95673602693282040

real    0m4.601s
user    0m4.387s
sys     0m0.251s

Second the version using Data.IntMap.Strict from here

time stack exec primIntMapStrict
95673602693282040

real    0m2.775s
user    0m2.753s
sys     0m0.052s

Shershs code with Data.Judy dropped in here

time stack exec prim-hash2
95673602693282040

real    0m0.945s
user    0m0.955s
sys     0m0.028s

Your python solution.

I compiled it with

python -O -m py_compile problem10.py

and the timing:

time python __pycache__/problem10.cpython-36.opt-1.pyc
95673602693282040

real    0m1.163s
user    0m1.160s
sys     0m0.003s

Your C++ version:

$ g++ -O2 --std=c++11 p10.cpp -o p10
$ time ./p10
sum(2000000000) = 95673602693282040

real    0m0.314s
user    0m0.310s
sys     0m0.003s

I didn't bother to provide a baseline for slow.hs, as I didn't want to wait for it to complete when run with an argument of 2*10^9.

Subsecond performance

The following program runs in under a second on my machine.

It uses a hand rolled hashmap, which uses closed hashing with linear probing and uses some variant of knuths hashfunction, see here.

Certainly it is somewhat tailored to the case, as the lookup function for example expects the searched keys to be present.

Timings:

time stack exec prim
95673602693282040

real    0m0.725s
user    0m0.714s
sys     0m0.047s

First I implemented my hand rolled hashmap simply to hash the keys with

key `mod` size

and selected a size multiple times higher than the expected input, but the program took 22s or more to complete.

Finally it was a matter of choosing a hash function which was good for the workload.

Here is the program:

import Data.Maybe
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead)

type Number = Int

data Map = Map { keys :: IOUArray Int Number
               , values :: IOUArray Int Number
               , size :: !Int 
               , factor :: !Int
               }

newMap :: Int -> Int -> IO Map
newMap s f = do
  k <- newArray (0, s-1) 0
  v <- newArray (0, s-1) 0
  return $ Map k v s f 

storeKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
storeKey arr s f key = go ((key * f) `mod` s)
  where
    go :: Int -> IO Int
    go ind = do
      v <- readArray arr ind
      go2 v ind
    go2 v ind
      | v == 0    = do { writeArray arr ind key; return ind; }
      | v == key  = return ind
      | otherwise = go ((ind + 1) `mod` s)

loadKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
loadKey arr s f key = s `seq` key `seq` go ((key *f) `mod` s)
  where
    go :: Int -> IO Int
    go ix = do
      v <- unsafeRead arr ix
      if v == key then return ix else go ((ix + 1) `mod` s)

insertIntoMap :: Map -> (Number, Number) -> IO Map
insertIntoMap m@(Map ks vs s f) (k, v) = do
  ix <- storeKey ks s f k
  writeArray vs ix v
  return m

fromList :: Int -> Int -> [(Number, Number)] -> IO Map
fromList s f xs = do
  m <- newMap s f
  foldM insertIntoMap m xs

(!) :: Map -> Number -> IO Number
(!) (Map ks vs s f) k = do
  ix <- loadKey ks s f k
  readArray vs ix

mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate (Map ks vs s fac) i f = do
  ix <- loadKey ks s fac i
  old <- readArray vs ix
  let x' = f old
  x' `seq` writeArray vs ix x'

r' :: Number -> Number
r'  = floor . sqrt . fromIntegral

vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]  

vss' n r = r + n `div` r -1

list' :: Int -> Int -> [Number] -> IO Map
list' s f vs = fromList s f [(i, i * (i + 1) `div` 2 - 1) | i <- vs]

problem10 :: Number -> IO Number
problem10 n = do
      m <- list' (19*vss) (19*vss+7) vs
      nm <- sieve m 2 r vs
      nm ! n
    where vs = vs' n r
          vss = vss' n r
          r  = r' n

sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r     = return m
               | otherwise = do
                   v1 <- m ! p
                   v2 <- m ! (p - 1)
                   nm <- if v1 > v2 then update m vs p else return m
                   sieve nm (p + 1) r vs

update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs

decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
  v <- sumOfSieved m k p
  mupdate m k (subtract v)
  return m

sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
  v1 <- m ! (v `div` p)
  v2 <- m ! (p - 1)
  return $ p * (v1 - v2)

main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9

I am not a professional with hashing and that sort of stuff, so this can certainly be improved a lot. Maybe we Haskellers should improve the of the shelf hash maps or provide some simpler ones.

My hashmap, Shershs code

If I plug my hashmap in Shershs (see answer below) code, see here we are even down to

time stack exec prim-hash2
95673602693282040

real    0m0.601s
user    0m0.604s
sys     0m0.034s

Why is slow.hs slow?

If you read through the source for the function insert in Data.HashTable.ST.Basic, you will see that it deletes the old key value pair and inserts a new one. It doesn't look up the "place" for the value and mutate it, as one might imagine, if one reads that it is a "mutable" hashtable. Here the hashtable itself is mutable, so you don't need to copy the whole hashtable for insertion of a new key value pair, but the value places for the pairs are not. I don't know if that is the whole story of slow.hs being slow, but my guess is, it is a pretty big part of it.

A few minor improvements

So that's the idea I followed while trying to improve your program the first time.

See, you don't need a mutable mapping from keys to values. Your key set is fixed. You want a mapping from keys to mutable places. (Which is, by the way, what you get from C++ by default.)

And so I tried to come up with that. I used IntMap IORef from Data.IntMap.Strict and Data.IORef first and got a timing of

tack exec prim
95673602693282040

real    0m2.134s
user    0m2.141s
sys     0m0.028s

I thought maybe it would help to work with unboxed values and to get that, I used IOUArray Int Int with 1 element each instead of IORef and got those timings:

time stack exec prim
95673602693282040

real    0m2.015s
user    0m2.018s
sys     0m0.038s

Not much of a difference and so I tried to get rid of bounds checking in the 1 element arrays by using unsafeRead and unsafeWrite and got a timing of

time stack exec prim
95673602693282040

real    0m1.845s
user    0m1.850s
sys     0m0.030s

which was the best I got using Data.IntMap.Strict.

Of course I ran each program multiple times to see if the times are stable and the differences in run time aren't just noise.

It looks like these are all just micro-optimizations.

And here is the program that ran fastest for me without using a hand rolled data structure:

import qualified Data.IntMap.Strict as M
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead, unsafeWrite)

type Number = Int
type Place = IOUArray Number Number
type Map = M.IntMap Place

tupleToRef :: (Number, Number) -> IO (Number, Place)
tupleToRef = traverse (newArray (0,0))

insertRefs :: [(Number, Number)] -> IO [(Number, Place)]
insertRefs = traverse tupleToRef

fromList :: [(Number, Number)] -> IO Map 
fromList xs = M.fromList <$> insertRefs xs

(!) :: Map -> Number -> IO Number
(!) m i = unsafeRead (m M.! i) 0

mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate m i f = do
  let place = m M.! i
  old <- unsafeRead place 0
  let x' = f old
  -- make the application of f strict
  x' `seq` unsafeWrite place 0 x'

r' :: Number -> Number
r'  = floor . sqrt . fromIntegral

vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]  

list' :: [Number] -> IO Map
list' vs = fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]

problem10 :: Number -> IO Number
problem10 n = do
      m <- list' vs
      nm <- sieve m 2 r vs
      nm ! n
    where vs = vs' n r
          r  = r' n

sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r     = return m
               | otherwise = do
                   v1 <- m ! p
                   v2 <- m ! (p - 1)
                   nm <- if v1 > v2 then update m vs p else return m
                   sieve nm (p + 1) r vs

update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs

decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
  v <- sumOfSieved m k p
  mupdate m k (subtract v)
  return m

sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
  v1 <- m ! (v `div` p)
  v2 <- m ! (p - 1)
  return $ p * (v1 - v2)

main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9

If you profile that, you see that it spends most of the time in the custom lookup function (!), don't know how to improve that further. Trying to inline (!) with {-# INLINE (!) #-} didn't yield better results; maybe ghc already did this.

When I am off work, I will post on reddit asking for help. It would be a bit whacky if there wasn't an off-the-shelf library that beats my naive approaches here. — typetetris, Jun 19 '17 at 06:01

score 4 · Answer 3 · edited Jun 19 '17 at 21:29

This code of mine evaluates the sum to 2⋅10^9 in 0.3 seconds and the sum to 10^12 (18435588552550705911377) in 19.6 seconds (if given sufficient RAM).

import Control.DeepSeq 
import qualified Control.Monad as ControlMonad
import qualified Data.Array as Array
import qualified Data.Array.ST as ArrayST
import qualified Data.Array.Base as ArrayBase

primeLucy :: (Integer -> Integer) -> (Integer -> Integer) -> Integer -> (Integer->Integer)
primeLucy f sf n = g
  where
    r = fromIntegral $ integerSquareRoot n
    ni = fromIntegral n
    loop from to c = let go i = ControlMonad.when (to<=i) (c i >> go (i-1)) in go from

    k = ArrayST.runSTArray $ do
      k <- ArrayST.newListArray (-r,r) $ force $
        [sf (div n (toInteger i)) - sf 1|i<-[r,r-1..1]] ++
        [0] ++
        [sf (toInteger i) - sf 1|i<-[1..r]]
      ControlMonad.forM_ (takeWhile (<=r) primes) $ \p -> do
        l <- ArrayST.readArray k (p-1)
        let q = force $ f (toInteger p)

        let adjust = \i j -> do { v <- ArrayBase.unsafeRead k (i+r); w <- ArrayBase.unsafeRead k (j+r); ArrayBase.unsafeWrite k (i+r) $!! v+q*(l-w) }

        loop (-1)         (-div r p)              $ \i -> adjust i (i*p)
        loop (-div r p-1) (-min r (div ni (p*p))) $ \i -> adjust i (div (-ni) (i*p))
        loop r            (p*p)                   $ \i -> adjust i (div i p)

      return k

    g :: Integer -> Integer
    g m
      | m >= 1 && m <= integerSquareRoot n                       = k Array.! (fromIntegral m)
      | m >= integerSquareRoot n && m <= n && div n (div n m)==m = k Array.! (fromIntegral (negate (div n m)))
      | otherwise = error $ "Function not precalculated for value " ++ show m

primeSum :: Integer -> Integer
primeSum n = (primeLucy id (\m -> div (m*m+m) 2) n) n

If your integerSquareRoot function is buggy (as reportedly some are), you can replace it here with floor . sqrt . fromIntegral.

Explanation:

As the name suggests it is based upon a generalization of the famous method by "Lucy Hedgehog" eventually discovered by the original poster.

It allows you to calculate many sums of the form (with p prime) without enumerating all the primes up to N and in time O(N^0.75).

Its inputs are the function f (i.e., id if you want the prime sum), its summatory function over all the integers (i.e., in that case the sum of the first m integers or div (m*m+m) 2), and N.

PrimeLucy returns a lookup function (with p prime) restricted to certain values of n: .

doesn't compile for me, I tried that [here](https://gist.github.com/erwo42/67b15b3643685d9f5e1c63a4748964bf) and tried to compile it with `stack ghc -- -O 2 primediv.hs`. Got Errors like `force`, `primes` not in scope and `($!!)` not in scope. — typetetris, Jun 19 '17 at 16:33
Sorry for just copying from my standard library and not including all the imports. `force` and `($!!)` are from Control.DeepSeq which is included with most distributions. `primes` is just the list of primes, from whatever source. I recommend the arithmoi package from Hackage. — CarlEdman, Jun 19 '17 at 17:15
@CarlEdman I [filled the blanks](https://gist.github.com/AdamStelmaszczyk/c433b09c79a54d8b6a3464a758aa15f8) in your code and tried it with n = 2⋅10^9, it ran in 0.3s, impressive. Could you please describe your approach? — Adam Stelmaszczyk, Jun 19 '17 at 20:03
@AdamStelmaszczyk Happy to. Will do so as edit of appendix of my post. — CarlEdman, Jun 19 '17 at 20:46
I think I wasn't precise enough with the previous question, sorry. I understand how Lucy's algorithm work. What I would like to understand is why your Haskell code is so fast and is it the same method as Shersh's "my code + my map" that achieved 0.2s? E.g. "I used 2 mutable arrays, in the first one I store x... that is the same/not the same as Shersh, the difference is y". — Adam Stelmaszczyk, Jun 19 '17 at 21:43
@AdamStelmaszczyk Beyond the algorithm, there is no magic to the performance beyond the usual trial-by-error hand-tuning. It probably started with something like Shersh's and then was transformed bit-by-bit. Things that helped: Using a single mutable array; just modifying the array on each iteration, rather than copying to a new one, by performing the updates in careful order; unsafeRead/Write; custom inner loop rather than forM_; forcing results early via '$!!' and 'force'. — CarlEdman, Jun 19 '17 at 22:21
@AdamStelmaszczyk Also thanks for fixing up my TeX! I keep forgetting which forums allow which syntax. — CarlEdman, Jun 19 '17 at 22:23

score 2 · Answer 4 · edited Jun 19 '17 at 11:06

Try this and let me know how fast it is:

-- sum of primes

import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed

sieve :: Int -> UArray Int Bool
sieve n = runSTUArray $ do
    let m = (n-1) `div` 2
        r = floor . sqrt $ fromIntegral n
    bits <- newArray (0, m-1) True
    forM_ [0 .. r `div` 2 - 1] $ \i -> do
        isPrime <- readArray bits i
        when isPrime $ do
            let a = 2*i*i + 6*i + 3
                b = 2*i*i + 8*i + 6
            forM_ [a, b .. (m-1)] $ \j -> do
                writeArray bits j False
    return bits

primes :: Int -> [Int]
primes n = 2 : [2*i+3 | (i, True) <- assocs $ sieve n]

main = do
    print $ sum $ primes 1000000

You can run it on ideone. My algorithm is the Sieve of Eratosthenes, and it should be quite fast for small n. For n = 2,000,000,000, the array size may be a problem, in which case you will need to use a segmented sieve. See my blog for more information about the Sieve of Eratosthenes. See this answer for information about a segmented sieve (but not in Haskell, unfortunately).

I tried it with n = 2⋅10^9 to compare to other implementations, 71 seconds. From what I understand, the time complexity of segmented sieve is the same as non-segmented one. I appreciate that in reality, the locality of reference can improve the speed in segmented version, however my feeling is that it will be slower than _sub-linear_ algorithm presented by Lucy_Hedgehog. My question is about optimizing Haskell code implementing Lucy's algorithm, I clarified the question now a bit, I can see how it could be a bit confusing, sorry. Nice blog, I will read it, I especially like your essays. — Adam Stelmaszczyk, Jun 19 '17 at 12:35
The segmented sieve has the same big-oh time complexity, but will be very much faster due to locality of reference. Thank you for your nice comments about my blog. — user448810, Jun 19 '17 at 12:45

How to optimize this Haskell code summing up the primes in sublinear time?

4 Answers4

Subsecond performance

My hashmap, Shershs code

Why is slow.hs slow?

A few minor improvements

Linked