10

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file. The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file! The code is shown below. Please note that I tried the following:

  • Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.

  • Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.

    import Data.Text.Lazy (Text(..), cons, pack, append)
    import qualified Data.Text.Lazy as T
    import qualified Data.Text.Lazy.IO as TI
    import Data.Map.Strict hiding (foldr, map, foldl')
    import System.Environment
    import System.IO
    import Data.Word
    
    dictionate :: [Text] -> Map Text Word16
    dictionate = fromListWith (+) . (`zip` [1,1..])
    
    main = do
        [file,out] <- getArgs
        h <- openFile file ReadMode
        hO <- openFile out WriteMode
        mapM_ (flip hSetEncoding utf8) [h,hO]
        txt <- TI.hGetContents h
        TI.hPutStr hO . T.unlines . 
          map (uncurry ((. cons '\t' . pack . show) . append)) . 
          toList . dictionate . T.words $ txt
        hFlush hO
        mapM_ hClose [h,hO]
        print "success"
    

What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

recursion.ninja
  • 5,377
  • 7
  • 46
  • 78
haskelline
  • 1,116
  • 7
  • 15
  • How many distinct words are there, roughly, in the file? That should give quite a hint as to whether such high memory consumption is inevitable. – leftaroundabout Nov 05 '13 at 00:12
  • Are you reading the whole file into memory in order to process it? If so, that explains the high memory usage. Try reading in the file line by line. – acfrancis Nov 05 '13 at 00:12
  • @acfrancis: `Data.Text.Lazy.IO.hGetContents` should certainly get that point right. – leftaroundabout Nov 05 '13 at 00:14
  • 2
    @leftaroundabout let's assume the worst-case, all words in the file are unique. Should the Map size reach 30GB? – haskelline Nov 05 '13 at 00:17
  • That's surely possible. A tree-like structure such as `Map` has lots of overhead, I'd suspect (assuming x86-64) ~4 machine words per cell, extra two for each `Text`, and finally 1-2 to store the words' characters themselves. Add some total overhead for garbage collection, and 30GB seem not unlikely for hundreds of millions of words. – leftaroundabout Nov 05 '13 at 00:27
  • @leftaroundabout is there a way to keep parts of this data-structure on disk while still maintaining fast access for the parts currently in-use? – haskelline Nov 05 '13 at 00:29
  • 1
    I suppose it is, but probably quite difficult for a `Map`. If you don't expect much duplicates among the words, perhaps you should completely switch to something else. [External merge sort](http://en.wikipedia.org/wiki/External_sorting#External_merge_sort) (with duplicate-counting & nubbing as a separate step) is relatively simple. Of course _anything_ external is bound to involve lots of dirty IO; I wager it's actually easier to code this up in C or C++, which also give you much better control over the data structures' overhead. – leftaroundabout Nov 05 '13 at 00:38
  • Anecdotal supporting evidence: I just ran your program on a 516MB text file, generated by concatenating 512 copies of the text of a book. Memory consumption remained constant at ~16MB; the final output listed 25180 words. – duplode Nov 05 '13 at 00:48
  • @duplode That's really weird. What version of GHC are you using and on which architecture? Also are you processing unicode text? I am processing a mix of arabic and english text. – haskelline Nov 05 '13 at 00:49
  • @leftaroundabout is there anything of this sort already written for Haskell somewhere? – haskelline Nov 05 '13 at 00:50
  • 1
    It's not weird at all, @duplode just used a text file with way less unique words (inevitable for a natural language and not helped by the use of many copies of the _same_ text) than you apparently have in your file. – leftaroundabout Nov 05 '13 at 00:53
  • 1
    7.6.3, x86-64. Without knowing your input, I guess my results support what leftaroundabout said (my file was made from natural language text, so there are relatively few distinct words). – duplode Nov 05 '13 at 00:53
  • I wonder why Data.HashMap is much more slower than Data.Map although it's written on the `unordered-containers` page: "The containers have been optimized for performance critical use, both in terms of large data quantities and high speed". – haskelline Nov 05 '13 at 01:01
  • Why not try replacing Map then? It ought to be trivial to replace Map with HashMap and easy to replace it with Hashtable. – J. Abrahamson Nov 05 '13 at 01:01
  • @J.Abrahamson I did. That's how I got to know it's much slower :) – haskelline Nov 05 '13 at 01:04
  • @haskelline Ah, you commented as I wrote mine I think. – J. Abrahamson Nov 05 '13 at 01:06
  • 2
    Just to ask the stupid questions: you're compiling with `-O2`, right? – Daniel Wagner Nov 05 '13 at 01:11
  • @DanielWagner: yes, tried both `-O` and `-O2` :) – haskelline Nov 05 '13 at 01:11
  • 3
    You might profit from using something like [`bytestring-trie`](http://hackage.haskell.org/package/bytestring-trie-0.2.3/docs/Data-Trie.html) – J. Abrahamson Nov 05 '13 at 01:20
  • Try finding efficient word count in some other languages and compare. That may give you insight into the problem (i.e. Dictionary has high overhead? Making copies of strings?) – BraveNewCurrency Nov 05 '13 at 02:26
  • seconding @J.Abrahamson's suggestion; a trie is at least in theory a much better data structure for this. – jberryman Nov 05 '13 at 02:27
  • ...oh, and of course the obligatory "what did you learn from profiling?" comment – jberryman Nov 05 '13 at 02:31
  • Have you considered using a Trie for this? It's a very good tool for this kind of analysis: http://en.wikipedia.org/wiki/Trie http://hackage.haskell.org/package/TrieMap Tries have particular advantages over binary tries or hashmaps for your kind of problem. – itsbruce Nov 05 '13 at 09:28

2 Answers2

7

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.

Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.

In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.

As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

Mikhail Glushenkov
  • 14,928
  • 3
  • 52
  • 65
0

In the world of traditional data processing, this problem would have been done by sorting (externally on disk or magtape if needed), then scanning the sorted file to count the grouped-together runs of words. Of course you could do some partial reductions during the early phases of sorting, to save some space and time.

none
  • 1