Data Structure Decision on Efficient Frequency Calculation

Question

Question: Which data structure is more efficient when calculating n most frequent words in a text file. Hash tables or Priority Queues?

I've previously asked a question related to this subject however after the creative responses I got confused and I've decided on two data types that I actually implement easily; Hash table vs Priority Queues

Priority Queue Confusion: To be honest, I've listened to a lecture from youtube related to priority queues, understood it's every component, however when it comes to its applicability, I get confused. Using a binary heap I can easily implement the priority queue however my challenge is the match its components usage to frequency problem.

My Hash table Idea: Since in here deciding the on hash table's size was a bit uncertain I've decided to go with what makes more sense to me: 26. Due to the number of letters in alphabet. In addition, with a good hash function it would be efficient. However reaching out and out again for linked lists (using separate chaining for collusion) and incrementing its integer value by 1 ,in my opinion, wouldn't be efficient.

Sorry for the long post, but as fellow programmers which one would you recommend. If priority queue can you simply give me ideas to relate it to my question, if hash table could anything be done to make it even more efficient ?

@usr Thanks for the comment! Can you also give me an idea/critic/advise whether my hash table idea of implementing it with size 26 due to alphabet size is a good idea? — Ali, Apr 20 '12 at 23:38
@rolandbishop I agree on using a hash. Because the distribution of the first letter of words in the English language is not uniform picking a hash function that puts words into 26 bins would be a poor choice. Even if the distribution was uniform this is far too few bins. Check out this previous discussion from SO on '[What's a good hash function for English words](http://stackoverflow.com/questions/7700400/whats-a-good-hash-function-for-english-words)'. — vpiTriumph, Apr 20 '12 at 23:49

score 1 · Accepted Answer · answered Apr 21 '12 at 00:44

1

A hash table would be the faster of the two choices offered, besides making more sense. Rather than choosing the size 26, if you have an estimate of the total number of unique words (and most people's vocabularies outside of technical specialized terms is not a lot bigger than 10,000 - 20,000 is really big, and 30,000 is for people who make a hobby of collecting words), make the size big enough that you don't expect to ever fill it so the probability of a collision is low - not more than 25%. If you want to be more conservative, implement a function to rehash the contents of the table into a table of twice the original size (and make the size a prime, so only approximately twice the original size).

Now since this is tagged C++, you might ask yourself why you aren't just using a multiset straight out of the standard template library. It will keep a count of how many of each word you enter into it.

In either case you'll need to make a separate pass to find which of the words are the n most frequent, as you only have the frequencies, not the rank order of the frequencies.

answered Apr 21 '12 at 00:44

DRVic

2,481
1
15
22

I got a few questions.1) Using the 'multiaet' you offered would still take O(n) time to count the frequencies, right? 2) Are there any differences between 'multiset' and 'std::map' using the map as ? – Ali Apr 21 '12 at 00:50
1

+1 Right. There are N words, and each one needs to be searched for in the table. Since a hash table is O(1), the whole thing is O(N). Then he just needs to store in the hash table a struct-per-word, so he has a place to increment the count. – Mike Dunlavey Apr 21 '12 at 01:38
With a hash table the second pass to find the n most frequent has to spend time examining all the entries that are still empty, which costs something. Probably not as much as N log N where N is the size of the input. But possibly, depending on how well the table is sized. In terms of asymptotic cost the hash table may be faster than the multi-set (or even map< word, count>) - depending on how well the table size is optimized. In reality it isn't clear, since we don't even know the size of the text file. And the multi set gives the answer most directly. – DRVic Apr 23 '12 at 10:37
With regard to a difference between multiset and std::map< word, int >, as far as I can tell, one reasonable implementation of multiset would be as a std::map< type, int >. I'm not aware of any difference other than the interface. – DRVic Apr 23 '12 at 10:55

score 0 · Answer 2 · answered Apr 21 '12 at 00:29

0

Why don't you use a generic/universal string hashing function? After all you don't want to count the first letter, you want to count over all possible words. I'd keep the bucket count dynamic. If not you will need to do insane amounts of linked-list traversals.

answered Apr 21 '12 at 00:29

usr

168,620
35
240
369

Can you please tell me, in terms of counting frequencies, would the efficiency of a hashtable (even though if it was well implemented) be the same with the simple 'std::map'? – Ali Apr 21 '12 at 00:52
This needs to be measured. My guess is that a normal hashtable with automatic resizing and internal linking will be optimal. AFAIK this is the gold standard of hash tables. – usr Apr 21 '12 at 13:19

Data Structure Decision on Efficient Frequency Calculation

2 Answers2