0

How can a DAWG be created? I have found that there are two ways; one is converting a trie into a dawg and the other being creating a new DAWG straight away? Which one is the easiest? Can you please elaborate on the two and provide some links?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Usman Amjed
  • 89
  • 1
  • 7

1 Answers1

5

One way to think about the DAWG is as a minimum-state DFA for all of the words in your word list. As a result, the traditional algorithm for constructing a DAWG is the following:

  1. Start off by constructing a trie for the collection of words.
  2. Add a new node to the trie with edges from itself to itself on all inputs.
  3. For each missing letter transition in the trie, add a transition from the start node to this new dead node.
  4. (At this point, you now have a (probably non-minimum) DFA for the set of words.)
  5. Minimize the DFA using the standard algorithm for DFA state minimization.

Once you have done this, you will be left with a DAWG for the set of words you are interested in.

The runtime of this algorithm is as follows. Constructing the initial DFA can be done by constructing a trie for all the original words (which takes time O(n), where n is the total number of characters in all the input strings), then filling in the missing transitions (which takes time O(n|Σ|), where |Σ| is the number of different characters in your alphabet). From there, the minimization algorithm runs in time O(n2 |Σ|). This means that the overall runtime for the algorithm is O(n2 |Σ|).

To the best of my knowledge, there is no straightforward algorithm for incrementally constructing DAWGs. Typically, you would build a DAWG for a set of words only if you already had all the words in advance. Intuitively, this is true because inserting a new word that has some suffixes already present in the DAWG might require a lot of restructuring of the DAWG to make certain old accepting states not accepting and vice-versa. Theoretically speaking, this results because inserting a new word might dramatically change the equivalence classes of the DFA's distinguishability relation, which might require substantial changes to the DFA's structure.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • I am constructing a dawg for a dictionary? So point being i should basically build a trie first? If thats the case it would get easier because i have some references for creating a trie in a book? And i didn't completely understand the algorithm. Some visual help would be appreciated or a link maybe. Thankyou in advance. – Usman Amjed Dec 24 '12 at 21:19
  • @UsmanAmjed- I don't have any visualizations for the DFA minimization algorithm, since the algorithm is somewhat complicated. However, it's honestly not all that hard to code up. As for constructing tries - I would **strongly** suggest not trying to code up a DAWG until you feel comfortable working with tries. Tries are much simpler than DAWGs and give the same performance guarantees, but with a much higher memory overhead. I would suggest reading up on them first. – templatetypedef Dec 24 '12 at 21:22
  • Yeah i would. I am constructing a dictionary of around 500000 words; would the memory overhead be huge? Should i even go for dawgs? Are the very different from tries as far as traversing etc are concerned? – Usman Amjed Dec 24 '12 at 21:27
  • 1
    @UsmanAmjed- The trie is definitely going to have a huge memory overhead compared to the DAWG in this case, so compressing the words would be useful. Depending on your application, though, you might be better off using a hash table or some other structure. – templatetypedef Dec 24 '12 at 21:28
  • It is a text predictor with auto complete; auto correct and spell checker fucnctions – Usman Amjed Dec 24 '12 at 21:31
  • What kind of system are you running this on? My "small" computer has 4GB of ram, 500000 [ten times more than an average oxford english dictionary!] words with an average length of 10 characters and about 30 bytes overhead in the data structure still only makes 15MB - not like you need the 16GB my large computer has to cope with that. – Mats Petersson Dec 24 '12 at 21:46
  • @MatsPetersson- Oh whoops, I read that as five million, not five hundred thousand. I was thinking five million strings x ten chars per string x 26 pointers per node x 4-8 bytes per pointer, which is pushing it. But if I'm off by a factor of ten, that looks a lot more reasonable. – templatetypedef Dec 24 '12 at 21:47
  • 1
    @templatetypedef You can absolutely construct a dawg directly from a list of words, this is the standard goto paper, you can do it easier if the list is initially sorted, however it is possible if the list is unsorted also: http://www.aclweb.org/anthology/J00-1002.pdf – rossb83 Jul 24 '17 at 16:18
  • @rossb83 I stand corrected! Thanks for sharing! – templatetypedef Jul 24 '17 at 16:58