I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map
to keep track of the counts of each word as I traverse the file.
The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file!
The code is shown below. Please note that I tried the following:
Using
Data.HashMap.Strict
instead ofData.Map.Strict
.Data.Map
seems to perform better in terms of slower memory consumption increase rate.Reading the files using lazy
ByteString
instead of lazyText
. And then I encode it to Text do some processing and then encode it back toByteString
forIO
.import Data.Text.Lazy (Text(..), cons, pack, append) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TI import Data.Map.Strict hiding (foldr, map, foldl') import System.Environment import System.IO import Data.Word dictionate :: [Text] -> Map Text Word16 dictionate = fromListWith (+) . (`zip` [1,1..]) main = do [file,out] <- getArgs h <- openFile file ReadMode hO <- openFile out WriteMode mapM_ (flip hSetEncoding utf8) [h,hO] txt <- TI.hGetContents h TI.hPutStr hO . T.unlines . map (uncurry ((. cons '\t' . pack . show) . append)) . toList . dictionate . T.words $ txt hFlush hO mapM_ hClose [h,hO] print "success"
What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?