2

I have a simple toy example that seems to disagree with the garbage collector on what data structures can be reclaimed (aka memory leak). I am not trying to come up with more memory efficient versions of this algorithm (a good collection of better algorithms is here: Haskell Wiki - Prime numbers, rather an explanation why the garbage collector is not identifying the old, out of scope and unused portions of the list to reclaim that memory.

The code is here:

import Data.List (foldl')

erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
    | c < x = (x,y) : erat' (c,b) xs
    | c == x = (x+y,y) : erat' (c,True) xs
    | c > x = (x+y,y) : erat' (c,b) xs
erat' (c,b) []
    | b = []
    | otherwise = [(c,c)]

erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []

primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]

In essence, calling primes with a positive integer will return a list of all prime numbers up to and including that number. A list of pairs of primes and their high water mark multiple is passed to erat', together with a pair including a candidate and a boolean (False for prime and True for non-prime). Every non-recursive call to erat' will pass a new list, and I would expect that the output would contain, at most, certain shared cells from the beginning of the list up to the point of the first change.

As soon as the modified cells in the list passed to erat' come out of scope, the memory should be flagged to be recovered, but as you can see when you try calling primes with a large enough number (1,000,000, for example), the memory utilization can quickly spike to tens of gigabytes.

Now, the question is: why is this happening? Shouldn't the generational garbage collector detect dereferenced list cells to reclaim them? And, shouldn't it be fairly easy for it to detect that they don't have references because:

a) nothing can have references from data structures older than itself; b) there cannot be newer references because those cells/fragments are not even part of a referenceable data structure anymore, since it came out of scope?

Of course, a mutable data structure would take care of this, but I feel like resorting to mutability in a case like this is dropping some of the theoretical principles for Haskell on the floor.

Thanks to the people that commented (particularly Carl), I modified the algorithm slightly to add strictness (and the optimization of starting crossing the square of the new prime, since lower multiples will be crossed by multiples of lower primes too).

This is the new version:

import Data.List (foldl')

erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
    | c < x = x `seq` (x,y) : erat' (c,b) xs
    | c == x = x `seq` (x+y,y) : erat' (c,True) xs
    | c > x = x `seq` (x+y,y) : erat' (c,b) xs
erat' (c,b) []
    | b = []
    | otherwise = [(c*c,c)] -- lower multiples would be covered by multiples of lower primes

erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []

primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]

The memory consumption seems to still be quite significant. Are there any other changes to this algorithm that could help reduce the total memory utilization?

Since Will pointed out that I didn't provide full statistics, these are the numbers for a run of the updated version of primes listed just above, with 100000 as the parameter:

Statistics

And after applying the changes that Will proposed, the memory usage is now down considerably. See, for example, on a run of primes for 100000 again:

Statistics after changes proposed by Will

And last, this is the final code after the proposed changes were incorporated:

import Data.List (foldl')

erat'' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat'' (c,b) ((x,y):xs)
    | c < x  = (x,  y) : if x==y*y then (if b then xs 
                                              else xs++[(c*c,c)])
                                   else erat'' (c,b)    xs 
    | c == x = (x+y,y) : if x==y*y then            xs 
                                   else erat'' (c,True) xs 
    | c > x  = (x+y,y) : erat'' (c,b)    xs 
erat'' (c,True)  [] = []
erat'' (c,False) [] = [(c*c,c)]


primes'' :: Integer -> [Integer]
primes'' n = map snd $ foldl' (\a c -> (if null a then 0 else 
        case last a of (x,y) -> y) `seq` erat'' (c,False) a) [] [2..n]

And finally a run for 1,000,000 to have a feeling for performance in this new version:

enter image description here

fgv
  • 835
  • 8
  • 15
  • Your analysis would be correct if Haskell was a strict language, but it's not. – Carl Sep 27 '14 at 14:53
  • I tried changing to strictness using seq and deepseq, with similar results. I don't think non-strictness has a play in here, but I may be wrong. Would you care to elaborate? – fgv Sep 27 '14 at 14:58
  • This seems like a very poor use for `foldl'` at first glance. Did you try `foldr`? – John L Sep 27 '14 at 19:56
  • I can certainly try foldr. Ideally, I would prefer to parse the list from the left for clarity, and it was what I did initially with explicit recursion, which I converted then to foldl for succinctness. – fgv Sep 27 '14 at 20:37
  • @Will, I just used top in Linux (resource monitor in Windows). I didn't get too sophisticated there. – fgv Sep 28 '14 at 13:08
  • but how did you run it? in GHCi? – Will Ness Sep 28 '14 at 13:10
  • 1
    Ghci, initially. But then compiled the executable with ghc and ran it from the command line, since I wanted to take ghci out of the picture. I used -O2 as the compiler optimization flag. – fgv Sep 28 '14 at 13:15
  • You are correct, again, from a benchmarking standpoint, but I wanted to show that the stack overflow didn't happen. I'll benchmark again with the same number and post the results. – fgv Sep 28 '14 at 16:49
  • I guess you tested with your `erat'`; with my second `g` I'd expect you got it to finish under 1.5 seconds for the 100k. -- so the problem really is algorithmic - you use the side list both for detection (so smaller primes must be at the start) and for output, so are forced to add at the list's end - what gives it very strong bias towards quadratic behaviour. your original code was verging on qubic though. :) interesting. With Map it should be faster. – Will Ness Sep 28 '14 at 18:11
  • @Will, yes, I did. Your change to the fold made even my version cope with larger numbers without a stack overflow, but the big O of your version of erat' was significantly better than mine. I'll see if I have time to try a Map and will post the results. In any case, the current version resorting to (mostly) prelude functions (with the exception of foldl') should be generic enough for anyone to understand. I posted the finished code above for everyone's reference. – fgv Sep 28 '14 at 18:22
  • I personally much prefer [Richard Bird's sieve](http://www.haskell.org/haskellwiki/Prime_numbers#Linear_merging) (and [variations](http://www.haskell.org/haskellwiki/Prime_numbers#Tree_merging_with_Wheel)). :) re Map, see [this answer](http://stackoverflow.com/a/1140100). – Will Ness Sep 28 '14 at 18:31
  • And I have to agree to that too. A great explanation here: http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf – fgv Sep 28 '14 at 18:37
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/62077/discussion-between-will-ness-and-fgv). – Will Ness Sep 28 '14 at 18:47

2 Answers2

3

Assumption a) is false in the presence of laziness. And in fact, your code consists almost entirely of generating cons cells pointed to by older cons cells. erat' consumes a list element, then produces a (:) constructor pointing to a tuple and an unevaluated thunk which will perform a recursive call to erat'. Only when that thunk is later evaluated will the (:) list constructor actually point to its tail as a data structure. So yes, nearly every (:) you allocate in erat' is in fact pointing forward in time. (The only exception is the last one - [foo] is going to point to the pre-existing [] constructor when its (:) constructor is allocated.)

Assumption b) is nonsense in the presence of laziness. Scope determines visibility in Haskell, not lifetime. Lifetime depends on evaluation and reachability.

So what happens at runtime is that you build up pipeline of erat' calls in erat. Each one of them holds on to as much of its input as has been evaluated, slowly consuming it. The interesting part is that your code doesn't evaluate anything in advance - it seems like it should actually stream pretty well - except for the fact that the pipeline is too deep. The pipeline created is approximately n stages - this is (inefficient!) trial division, not the sieve of Eratosthenes. You should only be adding prime numbers to the pipeline, not every number.

Carl
  • 26,500
  • 4
  • 65
  • 86
  • I'll try adding strictness again to see if it behaves better (it didn't seem like it improved at all before). With respect to the comment to this being trial division, your assessment is incorrect. Numbers are only added in the base case of erat', and only if the second element of the pair is false (prime). The list only consists of primes and their largest multiple calculated so far, which is what the standard crib requires. – fgv Sep 27 '14 at 17:50
  • Adding strictness actually makes it worse, because it forces a ton more into memory at once. And this is definitely trial division. If it was the sieve, you'd never call `erat'` from the `foldl'` in `erat` with c == 4, for example. – Carl Sep 28 '14 at 00:37
  • erat' manages the crib, so it needs to be called for every number. However, with c==4, erat' would have a list with [(4,2),(3,3)]. Within erat', 4 would be flagged as a non-prime (setting the Bool to True), the list would change to [(6,2),(6,3)] (to cross the next multiples of 2 and 3), but 4 (correctly) would never be added to the crib. With c==5, the list would turn into [(6,2),(6,3),(5,5)]. Short of the optimization to start from the square of the new prime (since lower multiples will already be crossed by multiples of lower primes), do you propose that the crib should work differently? – fgv Sep 28 '14 at 02:55
  • 1
    in fact this _is_ the sieve of Eratosthenes as it enumerates a prime's multiples by addition. And it doesn't add non-primes, it's just that it decides not to do that in the very end, having scanned through and re-created the whole "crib" (?) list anew. – Will Ness Sep 28 '14 at 08:53
  • Folding across `[2..n]` is the underlying performance issue. Fold across just primes, and the logic will still work, but the memory use will reduce drastically. – Carl Sep 28 '14 at 16:03
  • well, it may be argued that the essence of the logic here is the double use of the side store both as input (for multiples detection) and output, and decoupling the two would be a *different* logic. still it'd have to comb through the candidates [2..n] (or 2:[3,5..n]). Searching through primes makes no sense, we'd just return them. [Using map to simulate priority queue](http://stackoverflow.com/a/1140100/849891) (instead of the list here) should also help, even if left-folding over [2..n] (25 secs to 20 mln, for *unoptimized* version ... though it does a *right* fold, essentially). – Will Ness Sep 28 '14 at 17:37
  • so *left* folding is the culprit, it forces us to add primes at the side list's end, giving it a strong bias toward quadratic behaviour. And if it were maintained in reversed order, the detection would slow down dramatically. With _right_ folding (i.e. guarded recursion) we'd just produce each found prime right away. – Will Ness Sep 28 '14 at 17:46
1

breaking update: You should use

map snd $ foldl' (\a c -> (if null a then 0 else 
        case last a of (x,y) -> y) `seq` erat' (c,False) a) [] [2..n]

to force the list fully on each iteration. It will consume less memory and run faster.

The reason for the above is that foldl' only forces the accumulator to weak head normal form, and even using last a isn't enough, as it would be forced just to a pair (_,_), without forcing its constituents.

But when your erat' function is changed so that it stops scanning the interim list of primes and their multiples as soon as possible, and shares its tail whenever possible (as described below), it is faster without the forcing, even if using more memory.


Your (updated) code, edited a little for legibility:

g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
    | c < x  = (x,  y) : g (c,b)    xs  -- `c < x` forces the x already, 
    | c == x = (x+y,y) : g (c,True) xs  --              no need for `seq`
    | c > x  = (x+y,y) : g (c,b)    xs
g (c,True)  [] = []
g (c,False) [] = [(c*c,c)]

primes :: Integer -> [Integer]
primes n = map snd $ foldl' (\a c -> g (c,False) a) [] [2..n]

So, your primes n is actually a little like a right fold on the reversed [2..n] list. Writing h for flip $ foldl' (\a c -> g (c,False) a), it is

= map snd $ h [2..n] $ []
= map snd $ h [3..n] $ [(2*2,2)]
= map snd $ h [4..n] $ (4,2)  :(g (3,False) [])
= map snd $ h [5..n] $ (4+2,2):(g (4,True ) $ g (3,False) [])
= map snd $ h [6..n] $ (6,2)  :(g (5,False) $ g (4,True ) $ g (3,False) [])
....

The strictness of foldl' has limited effect here as the accumulator is forced only to the weak head normal form.

Folding with (\a c -> last a `seq` g (c,False) a) would give us

= map snd $ ... $ g (3,False) [(2*2,2)]
= map snd $ ... $ g (4,False) [(4,2),(3*3,3)]
= map snd $ ... $ g (5,False) [(4+2,2),(9,3)]
= map snd $ ... $ g (6,False) [(6,2),(9,3),(5*5,5)]
= map snd $ ... $ g (7,False) [(6+2,2),(9,3),(25,5)]
= map snd $ ... $ g (8,False) [(8,2),(9,3),(25,5),(7*7,7)]
= map snd $ ... $ g (9,False) [(8+2,2),(9,3),(25,5),(49,7)]
= map snd $ ... $ g (10,False) [(10,2),(9+3,3),(25,5),(49,7)]
= map snd $ ... $ g (11,False) [(10+2,2),(12,3),(25,5),(49,7)]
....
= map snd $ ... $ g (49,False) 
           [(48+2,2),(48+3,3),(50,5),(49,7),(121,11)...(2209,47)]
....

but all these changes will be pushed through to the list by the final print anyway, so the laziness is not the immediate problem here (it causes stack overflow for bigger inputs, but that's secondary here). The problem is that your erat' (renamed g above) eventually pushes each entry through the whole list needlessly, recreating the whole list for each candidate number. This is a very heavy memory usage pattern.

It should stop as early as possible, and share the list's tail whenever possible:

g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
    | c < x  = (x,  y) : if x==y*y then (if b then xs 
                                              else xs++[(c*c,c)])
                                   else g (c,b)    xs 
    | c == x = (x+y,y) : if x==y*y then            xs 
                                   else g (c,True) xs 
    | c > x  = (x+y,y) :                g (c,b)    xs 
g (c,True)  [] = []
g (c,False) [] = [(c*c,c)]

Compiled with -O2 and run standalone, it runs under ~ N1.9 vs your original function's ~ N2.4..2.8..and rising, producing primes up to N.

(of course a "normal" sieve of Eratosthenes should run at about ~ N1.1, ideally, its theoretical time complexity being N log (log N)).

Will Ness
  • 70,110
  • 9
  • 98
  • 181
  • I'll change the foldl' and remove the seq from erat' as you suggested for good reasons. However the early exit from the list scanning in erat' has other consequences. Every pass through the list for the next number updates the list, crossing the next round of multiples that are higher than the candidate. Exiting early will leave multiples on the right uncrossed, so the next number will not be evaluated correctly. Did you compare your version with mine to see if you got the same results? What you propose could be done with a priority list, assuming ordering of the multiples, but not this list. – fgv Sep 28 '14 at 13:11
  • @fgv no, no, my edit keeps the semantics. yes, of course, same results. you *do* have ordering, not on multiples, but on primes (not on 1st, but 2nd field of a pair). – Will Ness Sep 28 '14 at 13:15
  • btw I tested it in GHCi only, and it gave me stack overflow for 100,000, and I had to change the folding function to force the last element to overcome that. I suspect it might also improve the memory situation. – Will Ness Sep 28 '14 at 13:38
  • you are right: your version still preserves the semantics, and runs with a better memory profile. Would you mind posting your code with respect to the forcing of the last element? – fgv Sep 28 '14 at 13:46
  • (sorry, was offline) it's at the top of the answer: the ``(if null a then 0 else case last a of (x,y) -> y) `seq` ...`` bit. – Will Ness Sep 28 '14 at 14:04
  • BTW, as I said I tested it in GHCi. Compiled with -O2, it might have quite a different behaviour, and *not using* `last`-forcing may actually be faster, even if using more memory overall. YMMV, depending on the compiler's version. – Will Ness Sep 28 '14 at 14:33