5

I want to count the number of occurrences of each character in a large file. Although I know that counting should be implemented in a strict way in Haskell (which I tried to achieve using foldl'), I am still running out of memory. For comparison: the file size is about 2GB, while the computer has 100GB of memory. There are not many different characters in that file - maybe 20. What am I doing wrong?

ins :: [(Char,Int)] -> Char -> [(Char,Int)]
ins [] c = [(c,1)]
ins ((c,i):cs) d
    | c == d = (c,i+1):cs
    | otherwise = (c,i) : ins cs d

main = do
    [file] <- getArgs
    txt <- readFile file
    print $ foldl' ins [] txt
Dune
  • 293
  • 1
  • 10
  • I recommend you read about [What is Weak Head Normal Form?](http://stackoverflow.com/questions/6872898/haskell-what-is-weak-head-normal-form) – freestyle Dec 23 '16 at 14:41
  • This works fine with `ins m c = Map.insertWith' (+) c (1::Int) m` – Michael Dec 23 '16 at 16:29

1 Answers1

7

Your ins function is creating tons of thunks that cause a lot of memory leak. foldl' only evaluates to weak head normal form which is not enough here. What you need is deepseq from Control.DeepSeq in order to get to normal form.

Alternatively, instead of association list, use Data.Map.Strict for counting. Also, If your IO is on the order of 2GB, you better use lazy ByteString instead of plain strings.

Bellow code should perform in constant memory space regardless of size of input:

import System.Environment (getArgs)
import Data.Map.Strict (empty, alter)
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = getArgs >>= B.readFile . head >>= print . B.foldl' go empty
  where
  go = flip $ alter inc
  inc :: Maybe Int -> Maybe Int
  inc Nothing  = Just 1
  inc (Just i) = Just $ i + 1
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • I generally don't recommend `deepseq` except for parallel programming. While it will work okay in this case, it is overkill even here--it can be expected to force twice as much as necessary! In other contexts, it's much worse. Also, `Data.Bytestring.Char8` is not at all appropriate for counting *characters*, unless the input is known to be restricted to something ASCII-like. Finally, `Data.Map` is not ideal for maps with small keys. It's typically much better to use `IntMap` and convert `Char` to and from `Int`. – dfeuer Dec 24 '16 at 14:11
  • That really helped! Thanks a lot! – Dune Dec 24 '16 at 16:08