Parsing one terabyte of text and efficiently counting the number of occurrences of each word

Question

Recently I came across an interview question to create a algorithm in any language which should do the following

Read 1 terabyte of content
Make a count for each reoccuring word in that content
List the top 10 most frequently occurring words

Could you let me know the best possible way to create an algorithm for this?

Edit:

OK, let's say the content is in English. How we can find the top 10 words that occur most frequently in that content? My other doubt is, if purposely they are giving unique data then our buffer will expire with heap size overflow. We need to handle that as well.

The obvious answer is a hashtable where the key is the word and the value is the count of occurrences. But that is too easy so I assume there are some constraints on the solution. What other requirements are there? — David Thielen, Aug 30 '12 at 05:23
Typically [Map Reduce](http://en.wikipedia.org/wiki/MapReduce) is a buzzword in this domain. I think such questions are not about any algorithm (details) in particular, but more about how to handle such an amount of data efficiently. — Christian.K, Aug 30 '12 at 05:25
@Christian.K MapReduce (with the sort between the mapper and the reducer) is the right answer; you should probably do an answer instead of a comment. — Ray Toal, Aug 30 '12 at 05:27
guys, just have a thought. Its a question about how you will read that much bulk data to the RAM? and parse one by one? I may use dictionary or hashtable but if its having distinct words (though its quite tough, but i'm giving aaaa aaa aaaaaa etc.,)then you have to store 1TB content in the running heap memory. We need to make a efficient parsing mechanism which in turn should read and managing it to manipulate the string. — VIRA, Aug 30 '12 at 05:58
Jerry Coffin makes a good analysis of the problem. The big question here is what type of input are talking about. Since it says words, you normally can assume a natural language. This reduces the vocabulary significantly. It depends on the position, but probably they want you to elaborate all that. The question is vague and if you don't spot those bits you equally fail. The interviewer wants to see you analyse the question as well. Questions like "Is the input English?" as important as bleeping out hashtable or map-reduce. — rioki, Sep 11 '12 at 10:01
Since you have stated the input data is English, running out of heap space shouldn't be a problem. [OED estimates the number of distinct words in English to be about 750000](http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language), and holding 750000 key value pairs in memory should be no trouble at all for any relatively modern machine. I imagine the same is true for any other natural language as well. — verdesmarald, Sep 13 '12 at 06:14
@Raj +1 A really interesting question. It's very good for seeing how a candidate thinks. Lots of good answers, but my favourite is Dominique Jacquel's for spotting how much easier it is if you can sort the data first. — TarkaDaal, Sep 13 '12 at 08:58
If it were a race I'd stop after the first 1GB, very likely to be representative — William Entriken, Jul 14 '14 at 16:49

zeFrenchy · Accepted Answer · 2013-07-25T08:48:39.070

Interview Answer

This task is interesting without being too complex, so a great way to start a good technical discussion. My plan to tackle this task would be:

Split input data in words, using white space and punctuation as delimiters
Feed every word found into a Trie structure, with counter updated in nodes representing a word's last letter
Traverse the fully populated tree to find nodes with highest counts

In the context of an interview ... I would demonstrate the idea of Trie by drawing the tree on a board or paper. Start from empty, then build the tree based on a single sentence containing at least one recurring word. Say "the cat can catch the mouse". Finally show how the tree can then be traversed to find highest counts. I would then justify how this tree provides good memory usage, good word lookup speed (especially in the case of natural language for which many words derive from each other), and is suitable for parallel processing.

Draw on the board

Draw the example trie

Demo

The C# program below goes through 2GB of text in 75secs on an 4 core xeon W3520, maxing out 8 threads. Performance is around 4.3 million words per second with less than optimal input parsing code. With the Trie structure to store words, memory is not an issue when processing natural language input.

Notes:

test text obtained from the Gutenberg project
input parsing code assumes line breaks and is pretty sub-optimal
removal of punctuation and other non-word is not done very well
handling one large file instead of several smaller one would require a small amount of code to start reading threads between specified offset within the file.

using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;

namespace WordCount
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Counting words...");
            DateTime start_at = DateTime.Now;
            TrieNode root = new TrieNode(null, '?');
            Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();

            if (args.Length == 0)
            {
                args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                      "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
            }

            if (args.Length > 0)
            {
                foreach (string path in args)
                {
                    DataReader new_reader = new DataReader(path, ref root);
                    Thread new_thread = new Thread(new_reader.ThreadRun);
                    readers.Add(new_reader, new_thread);
                    new_thread.Start();
                }
            }

            foreach (Thread t in readers.Values) t.Join();

            DateTime stop_at = DateTime.Now;
            Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
            Console.WriteLine();
            Console.WriteLine("Most commonly found words:");

            List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
            int distinct_word_count = 0;
            int total_word_count = 0;
            root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
            top10_nodes.Reverse();
            foreach (TrieNode node in top10_nodes)
            {
                Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
            }

            Console.WriteLine();
            Console.WriteLine("{0} words counted", total_word_count);
            Console.WriteLine("{0} distinct words found", distinct_word_count);
            Console.WriteLine();
            Console.WriteLine("done.");
        }
    }

    #region Input data reader

    public class DataReader
    {
        static int LOOP_COUNT = 1;
        private TrieNode m_root;
        private string m_path;        

        public DataReader(string path, ref TrieNode root)
        {
            m_root = root;
            m_path = path;
        }

        public void ThreadRun()
        {
            for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
            {
                using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                {
                    using (StreamReader sreader = new StreamReader(fstream))
                    {
                        string line;
                        while ((line = sreader.ReadLine()) != null)
                        {
                            string[] chunks = line.Split(null);
                            foreach (string chunk in chunks)
                            {
                                m_root.AddWord(chunk.Trim());
                            }
                        }
                    }
                }
            }
        }
    }

    #endregion

    #region TRIE implementation

    public class TrieNode : IComparable<TrieNode>
    {
        private char m_char;
        public int m_word_count;
        private TrieNode m_parent = null;
        private ConcurrentDictionary<char, TrieNode> m_children = null;

        public TrieNode(TrieNode parent, char c)
        {
            m_char = c;
            m_word_count = 0;
            m_parent = parent;
            m_children = new ConcurrentDictionary<char, TrieNode>();            
        }

        public void AddWord(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                {
                    if (!m_children.ContainsKey(key))
                    {
                        m_children.TryAdd(key, new TrieNode(this, key));
                    }
                    m_children[key].AddWord(word, index + 1);
                }
                else
                {
                    // not a letter! retry with next char
                    AddWord(word, index + 1);
                }
            }
            else
            {
                if (m_parent != null) // empty words should never be counted
                {
                    lock (this)
                    {
                        m_word_count++;                        
                    }
                }
            }
        }

        public int GetCount(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (!m_children.ContainsKey(key))
                {
                    return -1;
                }
                return m_children[key].GetCount(word, index + 1);
            }
            else
            {
                return m_word_count;
            }
        }

        public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
        {
            if (m_word_count > 0)
            {
                distinct_word_count++;
                total_word_count += m_word_count;
            }
            if (m_word_count > most_counted[0].m_word_count)
            {
                most_counted[0] = this;
                most_counted.Sort();
            }
            foreach (char key in m_children.Keys)
            {
                m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
            }
        }

        public override string ToString()
        {
            if (m_parent == null) return "";
            else return m_parent.ToString() + m_char;
        }

        public int CompareTo(TrieNode other)
        {
            return this.m_word_count.CompareTo(other.m_word_count);
        }
    }

    #endregion
}

Here the output from processing the same 20MB of text 100 times across 8 threads.

Counting words...
Input data processed in 75.2879952 secs

Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times

323618000 words counted
60896 distinct words found

You could significantly improve the performance of this by joining on your various threads instead of waiting actively with your `while (!done)` loop, I think. — Clément, Jun 25 '13 at 14:27
I edited the code to use thread joining as suggested. Performance is improved (only a little) and code is cleaner, so thank you for the suggestion. — zeFrenchy, Jul 25 '13 at 08:51
@zeFrenchy How COOL is that! This is the first time that I am hearing about a Trie tree! Its a pretty cool way to store the sum of words! I also just read about a DAWG from one of the answers below, which brings the next question: what is the difference between a Trie tree and a DAWG? — Pavan, Jan 17 '14 at 00:17
The Trie method will only let 'share' word prefixes. DAWG seem to add additional processing to allow 'sharing' of sequences at both ends. It looks interesting but I'm guessing word counting may become a little less simple to implement. If you have a need to find shared postfix, you could always fill a Trie with all the words reversed ... not as memory efficient as DAWG. — zeFrenchy, Jan 17 '14 at 09:59
It'd be just as simple to have word count with a dawg as it uses an end of word marker. If the algorithm isn't case sensitive, using a TrieNode[26] to hold child nodes would probably be faster than a dictionary lookup. — Matt Searles, Jan 18 '14 at 01:18
@zeFrenchy `sreader.ReadLine()` is a problem if the entire file consists of a single line. What other way would be the most efficient? I was thinking of reading into a buffer, but a single word could be split by buffers - dealing with this when I am guaranteed that no word is longer than a buffer is fairly trivial, but it is theoretically possible for a word to span multiple buffers... — nphx, Mar 09 '15 at 23:59
You could read char by char into a buffer until you find a white space, which is pretty much what `string.split()` does anyway. This way you'd be doing you word search while reading the input, so it might not even be much slower. — zeFrenchy, Mar 10 '15 at 10:04
@zeFrenchy Isn't reading through the entire stream one character at a time horribly inefficient though? — nphx, Mar 11 '15 at 09:12
Possibly ... if so, you can read chunks into a buffer, then pop char by char from it. — zeFrenchy, Mar 11 '15 at 09:39
@zeFrenchy Of course, why haven't I thought of that. Thank you. — nphx, Mar 11 '15 at 09:43
Hello, Great answer just a question! what is the Order of size of trie? I read somewhere that is O(CW), C being the usual size (in characters) of a word, and W the word that the trie will contain.Can you comment on that? is it true or false? — Libathos, Dec 22 '16 at 14:41
This is old stuff. Have you read the answer? I used 20MB of text and loaded it 100 times to simulate a bigger input file. Why do you think the entire text needs to be held in memory? The full code is there! Have you tried it? Off course not. — zeFrenchy, Oct 03 '20 at 13:52

score 25 · Answer 2 · answered Aug 30 '12 at 15:28

A great deal here depends on some things that haven't been specified. For example, are we trying to do this once, or are we trying to build a system that will do this on a regular and ongoing basis? Do we have any control over the input? Are we dealing with text that's all in a single language (e.g., English) or are many languages represented (and if so, how many)?

These matter because:

If the data starts out on a single hard drive, parallel counting (e.g., map-reduce) isn't going to do any real good -- the bottleneck is going to be the transfer speed from the disk. Making copies to more disks so we can count faster will be slower than just counting directly from the one disk.
If we're designing a system to do this on a regular basis, most of our emphasis is really on the hardware -- specifically, have lots of disks in parallel to increase our bandwidth and at least get a little closer to keeping up with the CPU.
No matter how much text you're reading, there's a limit on the number of discrete words you need to deal with -- whether you have a terabyte of even a petabyte of English text, you're not going to see anything like billions of different words in English. Doing a quick check, the Oxford English Dictionary lists approximately 600,000 words in English.
Although the actual words are obviously different between languages, the number of words per language is roughly constant, so the size of the map we build will depend heavily on the number of languages represented.

That mostly leaves the question of how many languages could be represented. For the moment, let's assume the worst case. ISO 639-2 has codes for 485 human languages. Let's assume an average of 700,000 words per language, and an average word length of, say, 10 bytes of UTF-8 per word.

Just stored as simple linear list, that means we can store every word in every language on earth along with an 8-byte frequency count in a little less than 6 gigabytes. If we use something like a Patricia trie instead, we can probably plan on that shrinking at least somewhat -- quite possibly to 3 gigabytes or less, though I don't know enough about all those languages to be at all sure.

Now, the reality is that we've almost certainly overestimated the numbers in a number of places there -- quite a few languages share a fair number of words, many (especially older) languages probably have fewer words than English, and glancing through the list, it looks like some are included that probably don't have written forms at all.

Summary: Almost any reasonably new desktop/server has enough memory to hold the map entirely in RAM -- and more data won't change that. For one (or a few) disks in parallel, we're going to be I/O-bound anyway, so parallel counting (and such) will probably be a net loss. We probably need tens of disks in parallel before any other optimization means much.

+1 for doing the system analysis. The question with interview questions is always, what do they want to know? Can you program or can you design a system. — rioki, Sep 11 '12 at 09:44
The [google 1-grams](http://books.google.com/ngrams/datasets) collection for *scanned books* in english alone weights ~2GB (zipped, but there is overhead in the files for the #repeats, I believe net of 1.5GB after decompression is a reasonable assumption ), So - the conclusions is pretty much correct if we handle english alone (or any other language), but I am afraid the "every language on earth" is a bit optimistic for this purpose. (This also does not include bi-grams, which weight ~20 GB zipped, or tri-grams which is approximately double then bi-grams) — amit, Sep 11 '12 at 10:50
Definitively a patricia/dawg structure should be used to index the words for most european languages — Kwariz, Sep 12 '12 at 16:28
Minor point: The number of words in highly synthetic/agglutinative language (like Finnish for example) is potentially much higher than in English, but I would venture the number of words would still be nowhere near the memory limits of a decent desktop. — verdesmarald, Sep 13 '12 at 06:40

score 15 · Answer 3 · edited May 23 '17 at 12:18

15

You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.

First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:

map(document):
  for each word w:
     EmitIntermediate(w,"1")

reduce(word,list<val>):
   Emit(word,size(list))

Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.

A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.

edited May 23 '17 at 12:18

Community

1
1

answered Aug 30 '12 at 05:31

amit

175,853
27
231
333

2

Map reduce is such a newfangled approach and probably not really what they want to hear in an interview question. Unless you can also explain HOW map reduce actually works. This problem can be implemented with 3 lines of perl code or 6 lines of C++. Map reduce actually is only really a great thing if used to scale over many processors. The 1 TB in this question is, so you can't all put it into memory... – rioki Sep 11 '12 at 09:36
@rioki: Given the fact that the concept of map-reduce is taken from the functional programming world, using these principles (and writing/reading to files when memory is starting to fill up) - The above pseudo-code can be implemented in any functional-programming language. – amit Sep 11 '12 at 10:26
But if you use a map reduce framework, the heavy lifting is done for you. There are big parts of "here happens magic" in those frameworks and these are exactly the pieces the question addresses. (At least I would want to hear about it.) – rioki Sep 12 '12 at 10:02
@rioki: This is true for the map-reduce framework. If you don't want to use it - this is a classical functional-programming method, which can be easily implemented in any functional language such as haskell or f#. – amit Sep 12 '12 at 10:05
2

I don't know why you insist on functional languages. I can also implement the exact algorithm it in C or any other language. The question is how to you process the data from the map to the reduce step. You then need to find an efficient way to do so. My naive implementation using a hashmap, marking loop and an evaluation loop in C is effectively sort of map reduce. But the answer refers to map reduce and if you are going to play that card in an interview, you better know how the bits work that are between the map and reduce steps. It gets interesting when you start to have multiple machines. – rioki Sep 12 '12 at 11:05
@rioki:(1) What's the problem with functional programming? Are you claiming C is better then haskel or f#? why? (2)IMHO, the interviewer assumed the map does not fit in memory (you need an average of ~500 repeats per word, of course it can change dependent on the average length),which I don't know if it is realistic. (3)As a person who comes from the field of Information Retrieval and text analysis - I can assure you that map-reduce is the dominant way in the industry to find the number of occurances of a term in a collection, which is later used to calculate a score of a document in a query. – amit Sep 12 '12 at 11:14
1

I am not claiming that C is "better" on an absolute scale. The problem is trivial to bleeding edge scientific research, depending on the boundary conditions. You said functional language, I said C as the most opposite counter example. The problem can be solved in myriad ways. If you read my comments you must come to the conclusion that I agree with you. My point, I hope it was clear, is that **in an interview** you need to know the details of at least one way to implement map-reduce, if you want to solve it with map reduce. – rioki Sep 17 '12 at 11:14
Especially for a high qualification position, that may do data analysis. My pet field is compiler construction. Parser generators are what you create a parser, everything else would be a waste of time. But if you want to work for me and want to use a parser generator, then you better know how it works (in principle). I also require from you that you can build a parser without one (for a very simple language). It it all about know your tools. As I said **in an interview**. – rioki Sep 17 '12 at 11:19
@amit can you further elaborate on your last paragraph ? – Geek Sep 17 '12 at 11:25
@Geek: The map-reduce framework has an inner parallel-sorting implementation, you can use it by mapping the data (in a second map-reduce step) so the key will be the #occurances of each word, and the value will be the word itself. The reduce will be the identity function. The result will be a sorted file of tupples `(occurances,word)`, which you can simply grep the top k. It is more scalable because the sorting is done in parallel, but it is a full sort [`O(nlogn)`], and requires more IO, while the heap solution is `O(nlogk)` with a single IO iteration. – amit Sep 17 '12 at 16:48

score 9 · Answer 4 · answered Aug 30 '12 at 05:27

9

Three things of note for this.

Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.

Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.

If it's any easier (to keep your head from spinning).

"You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."

Same exact problem.

answered Aug 30 '12 at 05:27

Will Hartung

115,893
19
128
203

you are certainly correct, which is hidden truth behind it. We need to make an efficient way in order to get the result. – VIRA Aug 30 '12 at 06:00

alienhard · Answer 5 · 2012-09-16T06:53:50.317

It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:

Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
Count-min sketch, to get an estimated count for any word

Those data structures require only very little constant space (exact amount depends on error you can tolerate).

See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.

score 3 · Answer 6 · edited Sep 14 '15 at 13:17

3

I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.

EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?

editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.

edited Sep 14 '15 at 13:17

Peter Cordes

328,167
45
605
847

answered Sep 13 '12 at 05:17

Matt Searles

2,656
2
16
18

+1 I've just spent a happy few minutes reading about this - that's brilliant. – TarkaDaal Sep 13 '12 at 09:03
@MattSearles Hey, what is the difference between DAWG and a Trie tree? I just found out about a Trie tree when reading the answer up above! What a cool way of large quantities of text! it then becomes very easy to find words that share the same prefix, suffix, etc. – Pavan Jan 17 '14 at 00:12
@Pavan the most important difference is that nodes/vertices are shared in a DAWG, http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton has more info – Matt Searles Jan 18 '14 at 01:09
Hey @MattSearles thank you, I managed to found that out shortly after asking the question after having done a little more research. I thought you might have come back with more differences but it seems thats possibly the main one. I did have one question about DAWG sharing nodes, if they share nodes how do you know where the EOF for each word is ? For example we have `{TO, TOPS, TAPS, TAPSTER}` how do you know that there is a word `TAPS` if its shared with `TAPSTER` and same with the word `TO` shared with `TOPS`? Someone looking at this DAWG would think there are only two words present? – Pavan Jan 18 '14 at 01:28
@MattSearles continued. On top of that, lets take `TOPSTITCHED`, because DAWG shares the same nodes, the letter `S` would be shared between `TOPS` and `TAPS` but the word extends with `TITCHED` yet how would you tell at first glance that `TAPS` cannot extend onwards adding `..TITCHED` making `TAPSTITCHED` which is not a word but `TOPSTITCHED` is and therefor can continue. In the Trie tree you can clearly see distinctions between words as each word that is connected has a # count next to each node, with seperated branches but not with DAWG. Or am I missing something? *uh oh* I'm confused again! – Pavan Jan 18 '14 at 01:37
@Pavan if you check the wiki link I posted above and take a look at the image it'll be much clearer. In a dawg, you put a bool flag on each node called IsEndOfWord :) so if you added TO and TOPS, the O node would be marked IsEndOfWord and so would the S node. You simply work down a branch until you hit an IsEndOfWord with the word length you're looking for, or at every IsEndOfWord if you want the whole list. I've not heard of combining suffixes though. The nodes only overlap for common prefixes in a DAWG – Matt Searles Jan 19 '14 at 19:21
@Pavan and Matt: If the nodes only overlap for common prefixes, it's a Trie. The DAFSA wikipedia article covers this. I guess a DAWG can only merge substrings in limited cases? So `TAPS` would need its own branch for the 3rd-`P` and 4th-`S`? Also note the last paragraph which proposes a non-trivial way to use a DAWG as a map from strings to associated data (e.g. a count of duplicates). Multiple words can have the same leaf node in a DAWG, so storing counts is only as simple as you hoped for Tries. I've been thinking about DAWGs for http://stackoverflow.com/q/32535222/224132. – Peter Cordes Sep 14 '15 at 13:34
Is it just me, or does that dotnetperls C# writeup look pretty terrible? It's not very well organized, and skips over some important details. The quality of one of the conclusions "This can yield C# programs that are a thousand times faster than C programs." seems about average for the quality of all the articles I looked at on the site. Because C programs can only linear search? Anyway, future readers, I'd advise looking for other DAWG resources, like http://xlinux.nist.gov/dads//HTML/directedAcyclicWordGraph.html – Peter Cordes Sep 14 '15 at 16:58

amit · Answer 7 · 2012-09-11T12:41:57.830

2

A different solution could be using an SQL table, and let the system handle the data as good as it can. First create the table with the single field word, for each word in the collection.

Then use the query (sorry for syntax issue, my SQL is rusty - this is a pseudo-code actually):

SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC

The general idea is to first generate a table (which is stored on disk) with all words, and then use a query to count and sort (word,occurances) for you. You can then just take the top K from the retrieved list.

To all: If I indeed have any syntax or other issues in the SQL statement: feel free to edit

edited Sep 11 '12 at 12:41

answered Aug 30 '12 at 06:22

amit

175,853
27
231
333

@O.D: At least as an interview question (and not an exam one): I see no reason why not. The interviewer at least will be impressed by the idea and "thinking out of the box", and if he is looking for a different approach he will just ask for one. I can see no harm from giving this approach as a solution in an interview - I'd give credits for the candidate. – amit Aug 30 '12 at 07:01
Its a good Idea no question, maby would count in an interview but mentioning SQL tables as an algorithm ... in an exam i would have got minus points if any at all. – CloudyMarble Aug 30 '12 at 07:07
@O.D: But SO is not about *general cases* - it is about answering the OP question, and the OP said he encountered it in an interview question. `Recently I came across an interview question...` – amit Aug 30 '12 at 07:10
No problem man, just wanted to mention this, im not downvoting the answer or anything, its just the idea. – CloudyMarble Aug 30 '12 at 07:14

score 2 · Answer 8 · answered Oct 20 '15 at 19:16

First, I only recently "discovered" the Trie data structure and zeFrenchy's answer was great for getting me up to speed on it.

I did see in the comments several people making suggestions on how to improve its performance, but these were only minor tweaks so I'd thought I'd share with you what I found to be the real bottle neck... the ConcurrentDictionary.

I'd wanted to play around with thread local storage and your sample gave me a great opportunity to do that and after some minor changes to use a Dictionary per thread and then combine the dictionaries after the Join() saw the performance improve ~30% (processing 20MB 100 times across 8 threads went from ~48 sec to ~33 sec on my box).

The code is pasted below and you'll notice not much changed from the approved answer.

P.S. I don't have more than 50 reputation points so I could not put this in a comment.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;

namespace WordCount
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Counting words...");
            DateTime start_at = DateTime.Now;
            Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
            if (args.Length == 0)
            {
                args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                      "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
            }

            List<ThreadLocal<TrieNode>> roots;
            if (args.Length == 0)
            {
                roots = new List<ThreadLocal<TrieNode>>(1);
            }
            else
            {
                roots = new List<ThreadLocal<TrieNode>>(args.Length);

                foreach (string path in args)
                {
                    ThreadLocal<TrieNode> root = new  ThreadLocal<TrieNode>(() =>
                    {
                        return new TrieNode(null, '?');
                    });

                    roots.Add(root);

                    DataReader new_reader = new DataReader(path, root);
                    Thread new_thread = new Thread(new_reader.ThreadRun);
                    readers.Add(new_reader, new_thread);
                    new_thread.Start();
                }
            }

            foreach (Thread t in readers.Values) t.Join();

            foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
            {
                roots[0].Value.CombineNode(root.Value);
                root.Dispose();
            }

            DateTime stop_at = DateTime.Now;
            Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
            Console.WriteLine();
            Console.WriteLine("Most commonly found words:");

            List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
            int distinct_word_count = 0;
            int total_word_count = 0;
            roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);

            top10_nodes.Reverse();
            foreach (TrieNode node in top10_nodes)
            {
                Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
            }

            roots[0].Dispose();

            Console.WriteLine();
            Console.WriteLine("{0} words counted", total_word_count);
            Console.WriteLine("{0} distinct words found", distinct_word_count);
            Console.WriteLine();
            Console.WriteLine("done.");
            Console.ReadLine();
        }
    }

    #region Input data reader

    public class DataReader
    {
        static int LOOP_COUNT = 100;
        private TrieNode m_root;
        private string m_path;

        public DataReader(string path, ThreadLocal<TrieNode> root)
        {
            m_root = root.Value;
            m_path = path;
        }

        public void ThreadRun()
        {
            for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
            {
                using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                using (StreamReader sreader = new StreamReader(fstream))
                {
                    string line;
                    while ((line = sreader.ReadLine()) != null)
                    {
                        string[] chunks = line.Split(null);
                        foreach (string chunk in chunks)
                        {
                            m_root.AddWord(chunk.Trim());
                        }
                    }
                }
            }
        }
    }

    #endregion

    #region TRIE implementation

    public class TrieNode : IComparable<TrieNode>
    {
        private char m_char;
        public int m_word_count;
        private TrieNode m_parent = null;
        private Dictionary<char, TrieNode> m_children = null;

        public TrieNode(TrieNode parent, char c)
        {
            m_char = c;
            m_word_count = 0;
            m_parent = parent;
            m_children = new Dictionary<char, TrieNode>();            
        }

        public void CombineNode(TrieNode from)
        {
            foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
            {
                char keyChar = fromChild.Key;
                if (!m_children.ContainsKey(keyChar))
                {
                    m_children.Add(keyChar, new TrieNode(this, keyChar));
                }
                m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
                m_children[keyChar].CombineNode(fromChild.Value);
            }
        }

        public void AddWord(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                {
                    if (!m_children.ContainsKey(key))
                    {
                        m_children.Add(key, new TrieNode(this, key));
                    }
                    m_children[key].AddWord(word, index + 1);
                }
                else
                {
                    // not a letter! retry with next char
                    AddWord(word, index + 1);
                }
            }
            else
            {
                if (m_parent != null) // empty words should never be counted
                {
                    m_word_count++;                        
                }
            }
        }

        public int GetCount(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (!m_children.ContainsKey(key))
                {
                    return -1;
                }
                return m_children[key].GetCount(word, index + 1);
            }
            else
            {
                return m_word_count;
            }
        }

        public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
        {
            if (m_word_count > 0)
            {
                distinct_word_count++;
                total_word_count += m_word_count;
            }
            if (m_word_count > most_counted[0].m_word_count)
            {
                most_counted[0] = this;
                most_counted.Sort();
            }
            foreach (char key in m_children.Keys)
            {
                m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
            }
        }

        public override string ToString()
        {
            return BuildString(new StringBuilder()).ToString();
        }

        private StringBuilder BuildString(StringBuilder builder)
        {
            if (m_parent == null)
            {
                return builder;
            }
            else
            {
                return m_parent.BuildString(builder).Append(m_char);
            }
        }

        public int CompareTo(TrieNode other)
        {
            return this.m_word_count.CompareTo(other.m_word_count);
        }
    }

    #endregion
}

score 1 · Answer 9 · answered Aug 30 '12 at 05:30

1

As a quick general algorithm I would do this.

Create a map with entries being the count for a specific word and the key being the actual string.  

for each string in content:
   if string is a valid key for the map:
      increment the value associated with that key
   else
      add a new key/value pair to the map with the key being the word and the count being one
done

Then you could just find the largest value in the map


create an array size 10 with data pairs of (word, count) 

for each value in the map
    if current pair has a count larger than the smallest count in the array
        replace that pair with the current one

print all pairs in array

answered Aug 30 '12 at 05:30

zanegray

768
7
13

3

The purpose of the "1 terabyte" part of the question is to make the straightforward approach of using a map (if by map you do mean something you expect to fit in memory) untenable. – Ray Toal Aug 30 '12 at 05:35
@RayToal: By my analysis (see my answer) that's generally untrue -- and more data won't change that. – Jerry Coffin Aug 31 '12 at 18:11
1

You can't hold the file in memory, but the map you can. There are some fuzzy bits, but we probably can assume that it is text on only one language. In the case of English this is around 600.000 words; probably fewer are used commonly. All you need to do is read the file word by word and you are good. – rioki Sep 11 '12 at 09:48
@rioki: For this to happen (the map fits in memory), you need each word to repeat ~500 times on average (could be more or less depending on the average word's length). I don't know if this assumption is realistic. – amit Sep 11 '12 at 10:28
In my opinion, the algorithm should be abstracted from the hardware and data structure implementations optimized for that hardware. – zanegray Sep 15 '12 at 06:44

CloudyMarble · Answer 10 · 2012-08-30T06:57:02.247

Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:

Get the 1st 10 found words where occurence >= 2
Get how many times these words occure in the entire string and delete them while counting
Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
Do the same for the rest of the string(which dosnt contain these words anymore).

You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)

But in the worst case you may have more rounds than in a conventional algorithm.

By the way its a problem i would try to solve with a functional programing language rather than OOP.

How would you handle deleting words from the string? Wouldn't that be a very slow process? — comecme, Sep 10 '12 at 18:44
This could surely be an issue, it may even be more performant not to delete them, im not sure, since thi would take longer in the next rounds. — CloudyMarble, Sep 11 '12 at 04:57

score 0 · Answer 11 · answered Aug 30 '12 at 05:45

0

Well, personally, I'd split the file into different sizes of say 128mb, maintaining two in memory all the time while scannng, any discovered word is added to a Hash list, and List of List count, then I'd iterate the list of list at the end to find the top 10...

answered Aug 30 '12 at 05:45

Chibueze Opata

9,856
7
42
65

Reading in blocks bigger than CPU cache size is a bad idea. Memory access to the dictionary is the main bottleneck in adding each word to the set. (well, really disk I/O will the bottleneck.) – Peter Cordes Sep 14 '15 at 01:51
I very much agree, if you're using an SSD though, you could try splitting to the number of processors and loading the huge chunks into virtual memory. Otherwise, you could try a single 16 x 32kbit read to take maximum advantage of cache. – Chibueze Opata Sep 14 '15 at 08:08

score 0 · Answer 12 · answered Sep 11 '12 at 09:28

0

The method below will only read your data once and can be tuned for memory sizes.

Read the file in chunks of say 1GB
For each chunk make a list of say the 5000 most occurring words with their frequency
Merge the lists based on frequency (1000 lists with 5000 words each)
Return the top 10 of the merged list

Theoretically you might miss words, althoug I think that chance is very very small.

answered Sep 11 '12 at 09:28

IvoTops

3,463
17
18

1

You would probably fail my interview with this answer. All we want to know are the frequencies of words. You can read the file word by word and store the words in a hash table with their frequencies. The 1TB refers to the input, not the table size. – rioki Sep 11 '12 at 09:54
That is assuming that the number of words fits into memory. My method will work without that requirement. – IvoTops Sep 11 '12 at 13:47
True, it took me a while to fully grasp the extent of the question. As I wrote somewhere else in the comments, you probably will fail the interview if you just blurt out an answer without asking about more details. – rioki Sep 12 '12 at 09:58

score 0 · Answer 13 · answered Sep 16 '12 at 19:59

Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well - http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .

This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.

score 0 · Answer 14 · answered Sep 16 '12 at 21:45

Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.

You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.

As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.

Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).

score 0 · Answer 15 · answered Oct 09 '14 at 12:39

Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.

http://en.wikipedia.org/wiki/Trie

here is implementation of tire in java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html

score 0 · Answer 16 · answered Jul 23 '17 at 07:47

MapReduce
WordCount can be acheived effciently through mapreduce using hadoop. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0 Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       String line = value.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
             word.set(tokenizer.nextToken());
             output.collect(word, one);
       }
         }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
         public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       int sum = 0;
           while (values.hasNext()) {
             sum += values.next().get();
           }
       output.collect(key, new IntWritable(sum));
     }
       }

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

16 Answers16

Interview Answer

Demo

Linked

Related