74

What would be the best data structure to store all the words of a dictionary? The best I could think of was to use a HashMap, which will map to a HashTable. Basically, depending upon the first character, we will get the associated HashTable and then using this, we can add the words starting from that character. We'll then pick a good hash function based on the string.

Is there a better approach?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Jatin
  • 31,116
  • 15
  • 98
  • 163

1 Answers1

154

Depending on what you want to do, there are many good data structures.

If you just want to store the words and ask "is this word here or not?", a standard hash table with no other fancy machinery is a reasonable approach. If that word is list fixed in advance, consider using a perfect hash table to get excellent performance and space usage.

If you want to be able to check if a given prefix exists while supporting fast lookups, a trie is a good option, though it can be a bit space-inefficient. It also supports fast insertions or deletions. It also allows for iteration in alphabetical order, which hashing doesn't offer. This is essentially the structure you've described in your answer, but depending on the use case other representations of tries might be better.

If in addition to the above, you know for a fact that the word list is fixed, consider using a DAWG (directed acyclic word graph), which is essentially a minimum-state DFA for the language. It's substantially more compact than the trie, but supports many of the same operations.

If you want trie-like behavior but don't want to pay a huge space penalty, the ternary search tree is another viable option, as is the radix tree. These are very different structures, but can be much better than the trie in different circumstances.

If space is a concern but you want a trie, look into the succinct trie representation, which has slower lookups but just about theoretically optimal space usage. The link discusses how it's being used in JavaScript as an easy way to transmit a huge amount of data. An alternative compact representation is the double-array trie, though admittedly I know very little about it.

If you want to use the dictionary for operations like spell-checking where you need to find words similar to other words, the BK-tree is an excellent data structure to consider.

starball
  • 20,030
  • 7
  • 43
  • 238
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 3
    +1 A comment: _though it can be a bit space-efficient_ ... inefficient, right? – Gert Arnold Apr 04 '12 at 19:36
  • @GertArnold- Whoops! Thanks for spotting that. Fixed. – templatetypedef Apr 04 '12 at 19:38
  • Perfect in every sense. Thanks :) – Jatin Apr 04 '12 at 20:20
  • @templatetypedef Sir, this can be modeled using a trie as you have mentioned. But I have a question here. TRIES are generally fed the key-data,so in this case, lets assume the name of the word. So, obviously we need an extra mapping of the word names with the word meaning. So, what kind of datastructure should one be using here? I mean one possible option is to go ahead with the hashtable approach. With that, we can identify the word name. But isnt having hashtable and trie at the same time an overkill? Im very curious. Please suggest. Thanks, – Pavan Dittakavi Jun 10 '12 at 18:18
  • 2
    @Pavan- Each node in the Trie already stores a bit representing whether or not the node is a word. You can replace that bit with a pointer to a string containing the definition of the word (if one exists) or null (if it's not a word). – templatetypedef Jun 10 '12 at 18:53
  • 1
    @templatetypedef What about if i need to find synonyms?? – Vivek Vardhan Mar 31 '15 at 07:26
  • @templatetypedef Correct if the Trie stores where the word to be searched is found and then can have a pointer to either list or some data which holds meanings or synonyms. – Krishna Oza Jul 16 '15 at 12:00
  • 1
    Depending on the needs, a set of bloom filters would allow extremely fast lookups (with a small chance of a false positive) and also be very space efficient. – Adrian McCarthy Jun 21 '16 at 22:14
  • @AdrianMcCarthy, re "small chance"; Even one is one too much. Bloom filters are useless for pretty much everything except for caches since computers are expected to give **definite** results, thus with or without a bloom filter, you still need whatever you need. – Pacerier Apr 09 '23 at 03:21
  • @Pacerier: As I said, it depends on the needs. Bloom filters were the only viable solution for early spelling checkers, where speed and memory efficiency were far more important than 0 false positives. A complete, literal word list also causes false positives in this context, since a user's misspelling could match a different but obscure word or an archaic spelling. If the odds of a Bloom filter false positive can be driven down to the same order of magnitude as a wrong-word match, the speed and memory efficiency make it a huge win. – Adrian McCarthy Apr 09 '23 at 18:36
  • @AdrianMcCarthy, It's not about odds at all. If your spelling checker puts a red curvy line for a word like "apple", "bottle", "chair", "drag", "elephant", then it has failed. What your probability for highlighting words like "john" "denmark" "pan troglodytes" are is completely irrelevant. – Pacerier Apr 11 '23 at 00:34
  • @templatetypedef, With all due respect to steve hanov, his article is very badly written. Pretty much unreadable. I wonder why you don't link to another article. – Pacerier Apr 11 '23 at 00:36
  • @Pacerier: A Bloom filter can give a false positive but never a false negative. Query a Bloom filter with a word that's in its dictionary, it will confirm that it's in its dictionary. It will _never_ flag "apple" as a misspelling. Given a misspelled word, however, there's a tiny chance it'll report that it is in the dictionary. Thus the spelling checker _might_ overlook a misspelling, though the odds of that are small and can be tuned to be very close to zero. – Adrian McCarthy Apr 11 '23 at 17:26