Processing a very large text file with lazy Texts and ByteStrings

Question

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file. The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file! The code is shown below. Please note that I tried the following:

Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.

Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.

import Data.Text.Lazy (Text(..), cons, pack, append)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import Data.Map.Strict hiding (foldr, map, foldl')
import System.Environment
import System.IO
import Data.Word

dictionate :: [Text] -> Map Text Word16
dictionate = fromListWith (+) . (`zip` [1,1..])

main = do
    [file,out] <- getArgs
    h <- openFile file ReadMode
    hO <- openFile out WriteMode
    mapM_ (flip hSetEncoding utf8) [h,hO]
    txt <- TI.hGetContents h
    TI.hPutStr hO . T.unlines . 
      map (uncurry ((. cons '\t' . pack . show) . append)) . 
      toList . dictionate . T.words $ txt
    hFlush hO
    mapM_ hClose [h,hO]
    print "success"

What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

How many distinct words are there, roughly, in the file? That should give quite a hint as to whether such high memory consumption is inevitable. — leftaroundabout, Nov 05 '13 at 00:12
Are you reading the whole file into memory in order to process it? If so, that explains the high memory usage. Try reading in the file line by line. — acfrancis, Nov 05 '13 at 00:12
@acfrancis: `Data.Text.Lazy.IO.hGetContents` should certainly get that point right. — leftaroundabout, Nov 05 '13 at 00:14
@leftaroundabout let's assume the worst-case, all words in the file are unique. Should the Map size reach 30GB? — haskelline, Nov 05 '13 at 00:17
That's surely possible. A tree-like structure such as `Map` has lots of overhead, I'd suspect (assuming x86-64) ~4 machine words per cell, extra two for each `Text`, and finally 1-2 to store the words' characters themselves. Add some total overhead for garbage collection, and 30GB seem not unlikely for hundreds of millions of words. — leftaroundabout, Nov 05 '13 at 00:27
@leftaroundabout is there a way to keep parts of this data-structure on disk while still maintaining fast access for the parts currently in-use? — haskelline, Nov 05 '13 at 00:29
I suppose it is, but probably quite difficult for a `Map`. If you don't expect much duplicates among the words, perhaps you should completely switch to something else. [External merge sort](http://en.wikipedia.org/wiki/External_sorting#External_merge_sort) (with duplicate-counting & nubbing as a separate step) is relatively simple. Of course _anything_ external is bound to involve lots of dirty IO; I wager it's actually easier to code this up in C or C++, which also give you much better control over the data structures' overhead. — leftaroundabout, Nov 05 '13 at 00:38
Anecdotal supporting evidence: I just ran your program on a 516MB text file, generated by concatenating 512 copies of the text of a book. Memory consumption remained constant at ~16MB; the final output listed 25180 words. — duplode, Nov 05 '13 at 00:48
@duplode That's really weird. What version of GHC are you using and on which architecture? Also are you processing unicode text? I am processing a mix of arabic and english text. — haskelline, Nov 05 '13 at 00:49
@leftaroundabout is there anything of this sort already written for Haskell somewhere? — haskelline, Nov 05 '13 at 00:50
It's not weird at all, @duplode just used a text file with way less unique words (inevitable for a natural language and not helped by the use of many copies of the _same_ text) than you apparently have in your file. — leftaroundabout, Nov 05 '13 at 00:53
7.6.3, x86-64. Without knowing your input, I guess my results support what leftaroundabout said (my file was made from natural language text, so there are relatively few distinct words). — duplode, Nov 05 '13 at 00:53
I wonder why Data.HashMap is much more slower than Data.Map although it's written on the `unordered-containers` page: "The containers have been optimized for performance critical use, both in terms of large data quantities and high speed". — haskelline, Nov 05 '13 at 01:01
Why not try replacing Map then? It ought to be trivial to replace Map with HashMap and easy to replace it with Hashtable. — J. Abrahamson, Nov 05 '13 at 01:01
@J.Abrahamson I did. That's how I got to know it's much slower :) — haskelline, Nov 05 '13 at 01:04
Just to ask the stupid questions: you're compiling with `-O2`, right? — Daniel Wagner, Nov 05 '13 at 01:11
You might profit from using something like [`bytestring-trie`](http://hackage.haskell.org/package/bytestring-trie-0.2.3/docs/Data-Trie.html) — J. Abrahamson, Nov 05 '13 at 01:20
Try finding efficient word count in some other languages and compare. That may give you insight into the problem (i.e. Dictionary has high overhead? Making copies of strings?) — BraveNewCurrency, Nov 05 '13 at 02:26
seconding @J.Abrahamson's suggestion; a trie is at least in theory a much better data structure for this. — jberryman, Nov 05 '13 at 02:27
...oh, and of course the obligatory "what did you learn from profiling?" comment — jberryman, Nov 05 '13 at 02:31
Have you considered using a Trie for this? It's a very good tool for this kind of analysis: http://en.wikipedia.org/wiki/Trie http://hackage.haskell.org/package/TrieMap Tries have particular advantages over binary tries or hashmaps for your kind of problem. — itsbruce, Nov 05 '13 at 09:28

Mikhail Glushenkov · Answer 1 · 2013-11-05T09:38:48.243

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.

Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.

In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.

As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

score 0 · Answer 2 · answered Nov 13 '13 at 06:07

In the world of traditional data processing, this problem would have been done by sorting (externally on disk or magtape if needed), then scanning the sorted file to count the grouped-together runs of words. Of course you could do some partial reductions during the early phases of sorting, to save some space and time.

Processing a very large text file with lazy Texts and ByteStrings

2 Answers2