2

I try to parse large log files in haskell. I'm using System.IO.Streams but it seems to eat a lot of memory when I fold over the input. Here are two (ugly) examples:

First load 1M Int to memory in a list.

let l = foldl (\aux p -> p:aux) [] [1..1000000]
return (sum l)

Memory consumption is beautiful. Ints eat 3Mb and the list needs 6Mb:

see memory consumption of building list of 1M Int

Then try the same with Stream of ByteStrings. We need an ugly back and forth conversation but I don't think makes any difference

let s = Streams.fromList $ map (B.pack . show) [1..1000000]

l <-  s >>= 
      Streams.map bsToInt  >>=
      Streams.fold (\aux p -> p:aux) [] 

return (sum l)

see memory consumption of building a list of Ints from a stream

Why does it need more memory? And it's even worse if I read it from a file. It needs 90Mb

result <- withFileAsInput file load
putStrLn $ "loaded " ++ show result
where load is = do
        l <- Streams.lines is >>= 
             Streams.map bsToInt  >>=
             Streams.fold (\aux p -> p:aux) [] 

        return (sum l)

My assumption is Streams.fold has some issues. Because the library's built in countInput method doesn't use it. Any idea?

EDIT

after investigation I reduced the question to this: why does this code needs an extra 50Mb?

do
  let l = map (Builder.toLazyByteString . intDec ) [1..1000000]
  let l2 = map (fst . fromJust . B.readInt) l
  return (foldl' (\aux p -> p:aux) [] l2)

without the conversions it only needs 30Mb, with the conversions 90Mb.

halacsy
  • 205
  • 2
  • 8

2 Answers2

3

In your first example, the foldl (\aux p -> p:aux) [] is redundant. It constructs a list with the same elements as the list it takes as an argument! Without the redundancy, the example is equivalent to sum [1..1000000] or foldl (+) 0 [1..1000000]. Also, it would be better to use the strict left fold foldl' to avoid the accumulation of reducible expressions on the heap. See Foldr Foldl Foldl' on the Haskell wiki.

In your last example, you are using System.IO.Streams.Combinators.fold for building a list of all the integers which are read from the file, and then try to sum the list like you did in your first example.

The problem is that, because of the sequencing of file read operations imposed by the IO monad, all the data in the file has been read before you start summing the list, and is lurking on the heap, possibly still untransformed from the original Strings and taking even more memory.

The solution is to perform the actual sum inside the fold as each new element arrives; that way you don't need to have the full list in memory at any time, only the current element (being able to do this while performing I/O is one of the aims of streaming libraries). And the fold provided by io-streams is strict, analogous to foldl'. So you don't accumulate reducible expressions on the heap, either.

Try something like System.IO.Streams.Combinators.fold (+) 0.

danidiaz
  • 26,936
  • 4
  • 45
  • 95
  • `foldl (\aux p -> p:aux) []` is not always redundant. Try `take 20 $ foldl (\aux p -> p:aux) [] [1..]` vs `take 20 $ [1..]`. – bennofs Aug 20 '13 at 20:01
  • @bennofs You are correct. There is a difference when dealing with infinite lists.http://stackoverflow.com/questions/3082324/foldl-versus-foldr-behavior-with-infinite-lists – danidiaz Aug 20 '13 at 20:30
  • Maybe I was just wrong and the problem comes from converting int to bytestring and then back. I compared the two solutions `foldl (\aux p -> p:aux) [] $ map (bsToInt . iToBs) [1..1000000]` and without conversion. An extra 40Mb appears if I have bytestring. – halacsy Aug 20 '13 at 21:05
-1

So the problem was the lazy creation of ByteStrings and not with the iterator. See Why creating and disposing temporal ByteStrings eats up my memory in Haskell?

Community
  • 1
  • 1
halacsy
  • 205
  • 2
  • 8