What is the best data structure to store words found in a document and a counter with their occurences?

Question

Let's say I have a corpus of documents which I want to read one by one and store them in a data structure. The structure will probably be a list of something. That something class will define a single document. Inside that class I'll have to use a data structure to store the contents from each document, what that should be? Also, if I want to count occurrences of words and retrieve the most frequent words in each document, will I have to use a data structure that will allow me to do this in time < O(n) that would take to examine all the contents sequentially?

score 2 · Accepted Answer · answered Jan 27 '11 at 09:11

2

Use an associative array, also called map or dictionary since different programming languages use different terms for the same data structure.

Every entry key would be a word and the counter would be the value of the entry. For example

{
  'on' -> 15,
  'and' -> 43,
  'I' -> 157,
  'confluence' -> 1,
  'dear' -> 2
}

answered Jan 27 '11 at 09:11

Boris Pavlović

63,078
28
122
148

That's what I thought too but in this case, after reading the document if I want to find the most frequent words, won't I have to iterate the whole structure to find them? If I use this structure, is there a better way to find the most frequent words without having to iterate again the whole structure? – user579674 Jan 27 '11 at 09:29
1

There's no way of avoiding iteration. You may implement an associative array where keys would be frequencies and values sets of words. In this case you'd be having to iterate on frequency increment. Let's suppose that word 'and' has a frequency of 155. Once the algorithm encounters it again it will have to remove it from the set of words with frequency of 155 which is impossible without iteration and add it to a set of words with frequency of 156. Needless to mention that you'll have to maintain integrity by hand because a word can be located in two sets with different frequencies... – Boris Pavlović Jan 27 '11 at 10:01
@Boris Pavlovic Yes you are right. So, can I have a separate structure for holding the most frequent words? In this case which would be the most appropriate? Something like a max heap wouldn't be possible? – user579674 Jan 27 '11 at 11:25
1

You may flatten the map before picking up the most frequent words. Iterate through the map and build another one which will be having frequencies as keys and sets of words as values. For this intermediately map use tree map implementation which maintains order of keys. At the end query this map for most frequent words. – Boris Pavlović Jan 27 '11 at 12:05
@Boris Pavlovic Do you know the equivalent structure in C# for the tree map? – user579674 Jan 27 '11 at 18:16
here's one: http://stackoverflow.com/questions/477954/java-treemap-equivalent-in-c – Boris Pavlović Jan 27 '11 at 21:57

What is the best data structure to store words found in a document and a counter with their occurences?

1 Answers1