I can't help mentioning that not only is this a map-reduce problem, it's the map-reduce problem.
That aside, the reason you would use a trie implementation is for efficiency in looking up each word to increment its count (or for adding a word that does not yet exist in the trie). In a basic trie, the lookup time per word is O(n)
, where n
is the number of characters in the word. Over an entire document, then, with no parallel processing, you would be looking at O(n)
time just for lookups, where n
is the number of characters in the document. Then, it would be (probably) a depth-first search to retrieve all the words so that you could extract the information you need. Worst-case performance of the depth-first search would be the same O(n)
, but the expected case would be better due to common prefixes.
If you use a different structure, such as the standard System.Collections.Generic.Dictionary<TKey, TValue>
, that involves a hash lookup, the cost is related to the hash lookup and implementation as well as the prevalence of hash collisions. However, even that may not be the major part of the cost. Assume arguendo that the hash lookup is constant-time and trivial. Because equal hash codes do not guarantee equal strings, as the MSDN docs warn repeatedly, it is still necessary to compare strings for equality, which is almost certainly implemented as O(n)
, where n
is the number of characters (for simplicity). So, depending on the implementations of the trie and some hash-lookup-based dictionary, the hash-lookup-based dictionary is likely no better than the trie, and it may well be worse.
One valid criticism of my analysis might be that the lookup at each node in the trie may not be constant-time; it would depend on the collection used to determine the edges to the succeeding nodes. However, a hash-lookup-based dictionary may work well here if we don't care about sorting the keys later. Hash collisions are unlikely when the input is one character, and equality comparisons would be much less involved than with full strings. The insert performance is likely reasonable as well, again depending on the implementation.
However, if you know you are going to determine the top n
words by word count, you likely need to keep track of the top n
word counts as you go in addition to keeping track of them in the trie. That way, you do not need to recompute the top n
after populating the trie.