0

Suppose I need to count words from a very large file ( words are split by " " )

I would do following

  1. Not load entire file in memory , read stream line by line.
  2. For each line Split words and add distinct word to "dictionary" ( I mean, use Dictionary Class in .NET ) with their count.

Now to retrieve most frequent word, sort dictionary and get it.

but most solutions are a favoring Trie Data structure for this , please clarify why (also, it would be great if why not hash table over dictionary is clarified ).

Thanks.

Sunil Vurity
  • 830
  • 9
  • 14
  • What means _very large_ exactly? – Tim Schmelter Sep 01 '14 at 22:09
  • 1
    "why not hash table over dictionary": a `Dictionary` *is* a hashtable; it's basically the same as the `Hashtable` class, except that it's generic. – Thomas Levesque Sep 01 '14 at 22:20
  • 2
    Why don't you just try it for yourself. You'll get plenty of help from Google when you query "c# trie class". When you compare how well it does against a Linq query or a Dictionary then you'll discover something that's very, very important to know about the way modern computers work. And be able to ask a good question about it. – Hans Passant Sep 01 '14 at 22:23

2 Answers2

1

I can't help mentioning that not only is this a map-reduce problem, it's the map-reduce problem.

That aside, the reason you would use a trie implementation is for efficiency in looking up each word to increment its count (or for adding a word that does not yet exist in the trie). In a basic trie, the lookup time per word is O(n), where n is the number of characters in the word. Over an entire document, then, with no parallel processing, you would be looking at O(n) time just for lookups, where n is the number of characters in the document. Then, it would be (probably) a depth-first search to retrieve all the words so that you could extract the information you need. Worst-case performance of the depth-first search would be the same O(n), but the expected case would be better due to common prefixes.

If you use a different structure, such as the standard System.Collections.Generic.Dictionary<TKey, TValue>, that involves a hash lookup, the cost is related to the hash lookup and implementation as well as the prevalence of hash collisions. However, even that may not be the major part of the cost. Assume arguendo that the hash lookup is constant-time and trivial. Because equal hash codes do not guarantee equal strings, as the MSDN docs warn repeatedly, it is still necessary to compare strings for equality, which is almost certainly implemented as O(n), where n is the number of characters (for simplicity). So, depending on the implementations of the trie and some hash-lookup-based dictionary, the hash-lookup-based dictionary is likely no better than the trie, and it may well be worse.

One valid criticism of my analysis might be that the lookup at each node in the trie may not be constant-time; it would depend on the collection used to determine the edges to the succeeding nodes. However, a hash-lookup-based dictionary may work well here if we don't care about sorting the keys later. Hash collisions are unlikely when the input is one character, and equality comparisons would be much less involved than with full strings. The insert performance is likely reasonable as well, again depending on the implementation.

However, if you know you are going to determine the top n words by word count, you likely need to keep track of the top n word counts as you go in addition to keeping track of them in the trie. That way, you do not need to recompute the top n after populating the trie.

Andrew
  • 14,325
  • 4
  • 43
  • 64
  • And it's also about memory consumption. It take much less time to keep words from the _very_ large file in trie. In hashmap for every variant of the word you'll have another record. But in trie you'll reusing already existing word parts. – Roman Pushkin Jun 27 '16 at 15:45
  • Your 'the map-reduce problem' link is broken. It's possible the one on this page is comparable but I can't really know, having never seen the original. https://hadoop.apache.org/docs/r2.9.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html – bielawski Jun 14 '18 at 13:12
  • @bielawski Yes, that link seems to be the equivalent, though the old one is on [archive.org](http://web.archive.org/web/20140913140934/http://hadoop.apache.org:80/docs/r0.18.3/mapred_tutorial.html). I'll edit. – Andrew Jun 14 '18 at 22:47
0

You can use File.ReadLines which is similar to a stream-reader.

var mostFrequent = File.ReadLines("Path")
    .SelectMany(l => l.Split()) // splits also by tabs
    .GroupBy(word => word)
    .OrderByDescending(g => g.Count())
    .First(); // or Take(10) if you want the top 10

Console.Write("Word:{0} Count:{1}", mostFrequent.Key, mostFrequent.Count());
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • @ThomasLevesque: _"Now to retrieve most frequent word, sort dictionary and get it."_ I don't see why he needs a dictionary if he just wants to find the most frequent word+count. – Tim Schmelter Sep 01 '14 at 22:19
  • By Very Large , I mean a TB file or 10 TB or more – Sunil Vurity Sep 01 '14 at 22:20
  • @TimSchmelter, I was referring to that part: "but most solutions are a favoring Trie Data structure for this , please clarify why" – Thomas Levesque Sep 01 '14 at 22:21
  • Not just a word , lets say I want it for 10 most frequent words – Sunil Vurity Sep 01 '14 at 22:21
  • @ThomasLevesque: i think he is referring to solutions like this: http://stackoverflow.com/questions/12190326/parsing-one-terabyte-of-text-and-efficiently-counting-the-number-of-occurrences With files which are that large my approach is not acceptable, however, i keep it for someone who needs a simple approach. – Tim Schmelter Sep 01 '14 at 22:22
  • @SunilVurity: If you need ten use `Take(10)` instead of `First` ;-) – Tim Schmelter Sep 01 '14 at 22:26