3

i know word count Qs has been asked many times and MAP seems to be the unanimous choice for it.

But i felt that MAP might take a lot of space if the text is huge and number of unique words are very high. So why not use a Trie? The leaf node will store the frequency of each word.

Or is it that Map is a clear winner when compared to trie?

Plz help me in understanding.

P.S. It was asked in SDE interview.

rajya vardhan
  • 1,121
  • 4
  • 16
  • 29
  • 1
    Why not implement it both ways, test and answer yourself? :) – unkulunkulu Jul 25 '11 at 09:19
  • 4
    From here (http://www.languagemonitor.com/global-english/number-of-words-in-the-english-language-1008879/) we can estimate the words in the english language to about 1M. From here (http://stackoverflow.com/questions/720507/how-can-i-estimate-memory-usage-of-stlmap) we get the formula for memory usage of a map. Now we can calculate that if your text is all words of the language your map will take about (word length 6) (32 bytes for short strings (Windows) + 4 bytes int) * 1M + (negligible overhead) = 36M ~ 34MB Memory. So i'd say unless you are in an embedded system you don't need to worry. – RedX Jul 25 '11 at 09:37
  • @RedX - The kind of explanation i was looking for. Hope i cud have marked it as accepted answer. Thanks!! – rajya vardhan Jul 25 '11 at 12:14
  • 1
    @RedX: Please add this as an answer - it is clear and concise. – Elemental Jul 25 '11 at 14:36
  • I suggested using a Trie in this answer do a different word counting question: http://stackoverflow.com/questions/6803874/how-do-i-count-repeated-words/6812935#6812935 . That question was asking about Java, where converting the bytes to a String so we could do the MAP lookup would ultimately lead to lots of GC pressure. Granted in C++ you can avoid this issue. What I really like about using a Trie vs. a Map is that you could combine insertion into the trie and reading from the file into the same loop. – eSniff Jul 26 '11 at 17:30

2 Answers2

4

A trie seems a very reasonable solution to me - certainly has a smaller footprint for most large bodies of text. Also suspect that depending on the data and internal workings on the mapping that it might be quicker. Really the only objection is that it is a bit of a overkill as unique word counting is not terribly processor intensive.

Elemental
  • 7,365
  • 2
  • 28
  • 33
4

From here we can estimate the words in the english language to about 1M. From here we get the formula for memory usage of a map. Now we can calculate that if your text is all words of the language, your map will take about (average word length 6 characters) (32 bytes for short strings (Windows) + 4 bytes int) * 1M + (negligible overhead) = 36M ~ 34MB Memory.

So i'd say unless you are in an embedded system you don't need to worry.

Community
  • 1
  • 1
RedX
  • 14,749
  • 1
  • 53
  • 76
  • If on embedded system, having limited virtual memory and RAM, would you suggest using trie? Suppose page size is 4K and available RAM is 2MB, How will we use text of 1GB size for counting? – Alam Aug 06 '12 at 14:46