I try to parse large log files in haskell. I'm using System.IO.Streams
but it seems to eat a lot of memory when I fold over the input. Here are two (ugly) examples:
First load 1M Int
to memory in a list.
let l = foldl (\aux p -> p:aux) [] [1..1000000]
return (sum l)
Memory consumption is beautiful. Ints eat 3Mb and the list needs 6Mb:
see memory consumption of building list of 1M Int
Then try the same with Stream of ByteStrings. We need an ugly back and forth conversation but I don't think makes any difference
let s = Streams.fromList $ map (B.pack . show) [1..1000000]
l <- s >>=
Streams.map bsToInt >>=
Streams.fold (\aux p -> p:aux) []
return (sum l)
see memory consumption of building a list of Ints from a stream
Why does it need more memory? And it's even worse if I read it from a file. It needs 90Mb
result <- withFileAsInput file load
putStrLn $ "loaded " ++ show result
where load is = do
l <- Streams.lines is >>=
Streams.map bsToInt >>=
Streams.fold (\aux p -> p:aux) []
return (sum l)
My assumption is Streams.fold has some issues. Because the library's built in countInput method doesn't use it. Any idea?
EDIT
after investigation I reduced the question to this: why does this code needs an extra 50Mb?
do
let l = map (Builder.toLazyByteString . intDec ) [1..1000000]
let l2 = map (fst . fromJust . B.readInt) l
return (foldl' (\aux p -> p:aux) [] l2)
without the conversions it only needs 30Mb, with the conversions 90Mb.