How can a DAWG be created? I have found that there are two ways; one is converting a trie into a dawg and the other being creating a new DAWG straight away? Which one is the easiest? Can you please elaborate on the two and provide some links?
1 Answers
One way to think about the DAWG is as a minimum-state DFA for all of the words in your word list. As a result, the traditional algorithm for constructing a DAWG is the following:
- Start off by constructing a trie for the collection of words.
- Add a new node to the trie with edges from itself to itself on all inputs.
- For each missing letter transition in the trie, add a transition from the start node to this new dead node.
- (At this point, you now have a (probably non-minimum) DFA for the set of words.)
- Minimize the DFA using the standard algorithm for DFA state minimization.
Once you have done this, you will be left with a DAWG for the set of words you are interested in.
The runtime of this algorithm is as follows. Constructing the initial DFA can be done by constructing a trie for all the original words (which takes time O(n), where n is the total number of characters in all the input strings), then filling in the missing transitions (which takes time O(n|Σ|), where |Σ| is the number of different characters in your alphabet). From there, the minimization algorithm runs in time O(n2 |Σ|). This means that the overall runtime for the algorithm is O(n2 |Σ|).
To the best of my knowledge, there is no straightforward algorithm for incrementally constructing DAWGs. Typically, you would build a DAWG for a set of words only if you already had all the words in advance. Intuitively, this is true because inserting a new word that has some suffixes already present in the DAWG might require a lot of restructuring of the DAWG to make certain old accepting states not accepting and vice-versa. Theoretically speaking, this results because inserting a new word might dramatically change the equivalence classes of the DFA's distinguishability relation, which might require substantial changes to the DFA's structure.
Hope this helps!

- 362,284
- 104
- 897
- 1,065
-
I am constructing a dawg for a dictionary? So point being i should basically build a trie first? If thats the case it would get easier because i have some references for creating a trie in a book? And i didn't completely understand the algorithm. Some visual help would be appreciated or a link maybe. Thankyou in advance. – Usman Amjed Dec 24 '12 at 21:19
-
@UsmanAmjed- I don't have any visualizations for the DFA minimization algorithm, since the algorithm is somewhat complicated. However, it's honestly not all that hard to code up. As for constructing tries - I would **strongly** suggest not trying to code up a DAWG until you feel comfortable working with tries. Tries are much simpler than DAWGs and give the same performance guarantees, but with a much higher memory overhead. I would suggest reading up on them first. – templatetypedef Dec 24 '12 at 21:22
-
Yeah i would. I am constructing a dictionary of around 500000 words; would the memory overhead be huge? Should i even go for dawgs? Are the very different from tries as far as traversing etc are concerned? – Usman Amjed Dec 24 '12 at 21:27
-
1@UsmanAmjed- The trie is definitely going to have a huge memory overhead compared to the DAWG in this case, so compressing the words would be useful. Depending on your application, though, you might be better off using a hash table or some other structure. – templatetypedef Dec 24 '12 at 21:28
-
It is a text predictor with auto complete; auto correct and spell checker fucnctions – Usman Amjed Dec 24 '12 at 21:31
-
What kind of system are you running this on? My "small" computer has 4GB of ram, 500000 [ten times more than an average oxford english dictionary!] words with an average length of 10 characters and about 30 bytes overhead in the data structure still only makes 15MB - not like you need the 16GB my large computer has to cope with that. – Mats Petersson Dec 24 '12 at 21:46
-
@MatsPetersson- Oh whoops, I read that as five million, not five hundred thousand. I was thinking five million strings x ten chars per string x 26 pointers per node x 4-8 bytes per pointer, which is pushing it. But if I'm off by a factor of ten, that looks a lot more reasonable. – templatetypedef Dec 24 '12 at 21:47
-
1@templatetypedef You can absolutely construct a dawg directly from a list of words, this is the standard goto paper, you can do it easier if the list is initially sorted, however it is possible if the list is unsorted also: http://www.aclweb.org/anthology/J00-1002.pdf – rossb83 Jul 24 '17 at 16:18
-
@rossb83 I stand corrected! Thanks for sharing! – templatetypedef Jul 24 '17 at 16:58