3

I have the following Haskell program for computing a maximum sum substring of a string of integers:

{-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 #-}
import Data.Functor
import Data.Maybe 
import Data.ByteString.Lazy.Char8 (getContents,lines,readInt,words)
import Prelude hiding (getContents,words,lines)

main = do
    cont <- words <$> getContents
    putStrLn $ show $ snd $ foldl opt (0,0) $ map (fst.fromJust.readInt) cont

opt (!c,!m) x = (max 0 (c+x),max m (c+x))

The problem with this program is that it reads the whole file into memory. The corresponding program without BytesString does not have this problem:

{-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 #-}
import Data.Functor
import Data.Maybe 

main = do
    cont <- words <$> getContents
    putStrLn $ show $ snd $ foldl opt (0,0) $ map read cont
opt (!c,!m) x = (max 0 (c+x),max m (c+x))

It only uses a small constant amount of memory, but of course it is excruciatingly slow (about 25x slower).

The problem only occurs for programs that read very large lines. If the input is spread over multiple small lines, ByteString performs as expected.

Is there any way around this?

user1531083
  • 750
  • 6
  • 15
  • I have to admit that I am not able to reproduce this anymore (as long as I leave optimization enabled)... – user1531083 Sep 02 '13 at 20:34
  • 1
    Did you always have the OPTIONS_GHC on the same line as the other pragma? That likely doesn't work to actually enable optimizations. – Don Stewart Sep 03 '13 at 08:23

2 Answers2

6

The use of lazy tuples there is sub-optimal. This is better rewritten as:

main = do
    cont <- words <$> getContents
    putStrLn $ show $ sndT $ foldl opt (T 0 0) $ map (fst.fromJust.readInt) cont

sndT :: T -> Int
sndT (T _ m) = m

opt (T c m) x = T (max 0 (c+x)) (max m (c+x))

data T = T {-# UNPACK #-} !Int  {-# UNPACK #-}!Int

So you get a strict, unboxed accumulator. However, you're better off writing this whole thing as an incremental left fold. that's why readInt returns the remaining input in its 2nd parameter. No need for the sum . map . words pipeline.


The version you submitted leaks space. Run on a large file, and it uses heap proportional to the file size (on 640k entries).

$ time ./A +RTS -p -s -K50M < input.txt.2
346882
     326,337,136 bytes allocated in the heap
     302,321,732 bytes copied during GC
      82,617,772 bytes maximum residency (8 sample(s))
       1,466,500 bytes maximum slop
             149 MB total memory in use (0 MB lost due to fragmentation)

  %GC     time      63.8%  (63.9% elapsed)

So it is retaining the file, as you say.

enter image description here

So what is retaining memory? One clue is the foldl with opt. You pass it a lazy tuple. And foldl is lazy in its accumulator.

So you're simply building up a long chain of unevaluated + operations. The bang patterns on opt make no difference, since foldl never evaluates its accumulator. Only when you finally inspect the result at the end does the whole thing collapse.

This is a classic space leak. So:

  • Use foldl' -- it is strict in the accumulator
  • Avoid intermediate lists entirely (use readInt + unfoldr).
  • If you must traverse a list, use a strict tuple on the accumulator for even better performance.
Don Stewart
  • 137,316
  • 36
  • 365
  • 468
  • Thank you that works quite nicely. However I still do not understand why my tuples lead to a space leak. Could you enlighten me there? – user1531083 Sep 02 '13 at 11:09
  • And what I absolutely don't get: Why do I only leak space with the ByteString version of my program? – user1531083 Sep 02 '13 at 12:14
  • Your explanation cannot be the reason for the observed behaviour, because 1. I tried foldl' and it made no difference at all 2. the String version never leaks space 3. The program can cope with 100M of data which would be a stackoverflow with unevaluated `+` – user1531083 Sep 02 '13 at 12:51
  • Now I am totally confused, I cannot reproduce my own problem...?? The ByteString program I gave just consumed a LARGE file without retaining it in memory on my computer. I don't think there was any software-update or anything. – user1531083 Sep 02 '13 at 12:58
1

map (fst.fromJust.readInt) cont

shouldn't this be

main = do
    cont <- getContents
    putStrLn $ show $ snd $ foldl opt (0,0) $
               unfoldr (readInt . dropWhile isSpace) cont
Sassa NF
  • 5,306
  • 15
  • 22
  • I also have to drop the spaces in between so more something like `unfoldr (readInt . dropWhile (==' '))`, but thank you, I did not know unfoldr. – user1531083 Sep 02 '13 at 11:07
  • yes, sorry, i realized that, but looks like the edit went through too late – Sassa NF Sep 02 '13 at 11:08