Suppose I have several 200mb+ files that I want to grep through. How would I do this in Haskell?
Here's my initial program:
import Data.List
import Control.Monad
import System.IO
import System.Environment
main = do
filename <- liftM head getArgs
contents <- liftM lines $ readFile filename
putStrLn . unlines . filter (isPrefixOf "import") $ contents
This reads the whole file into memory before parsing through it. Then I went with this:
import Data.List
import Control.Monad
import System.IO
import System.Environment
main = do
filename <- liftM head getArgs
file <- (openFile filename ReadMode)
contents <- liftM lines $ hGetContents file
putStrLn . unlines . filter (isPrefixOf "import") $ contents
I thought since hGetContents
is lazy, it will avoid reading the whole file into memory. But running both scripts under valgrind
showed similar memory usage for both. So either my script is wrong, or valgrind
is wrong. I compile the scripts using
ghc --make test.hs -prof
What am I missing? Bonus question: I see a lot of mentions on SO of how Lazy IO in Haskell is actually a bad thing. How / why would I use strict IO?
Update:
So it looks like I was wrong in my reading of valgrind. Using +RTS -s
, here's what I get:
7,807,461,968 bytes allocated in the heap
1,563,351,416 bytes copied during GC
101,888 bytes maximum residency (1150 sample(s))
45,576 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 13739 collections, 0 parallel, 2.91s, 2.95s elapsed
Generation 1: 1150 collections, 0 parallel, 0.18s, 0.18s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.07s ( 2.28s elapsed)
GC time 3.09s ( 3.13s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.16s ( 5.41s elapsed)
The important line is 101,888 bytes maximum residency
, which says that at any given point my script was using 101 kb of memory at most. The file I was grepping through was 44 mb. So I think the verdict is: readFile
and hGetContents
are both lazy.
Follow-up question:
Why do I see 7gb of memory allocated on the heap? That seems really high for a script that's reading in a 44 mb file.
Update to follow-up question
Looks like a few gb of memory allocated on the heap is not atypical for Haskell, so no cause for concern. Using ByteString
s instead of String
s takes the memory usage down a lot:
81,617,024 bytes allocated in the heap
35,072 bytes copied during GC
78,832 bytes maximum residency (1 sample(s))
26,960 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)