Can a DAWG be used to store auxiliary information relating to each path, e.g. a word's frequency in the English language? If yes, then how can i do that?
3 Answers
Yes, in general, a directed acyclic weighted graph (DAWG) can be annotated, either by node, edge, or more complicated structure, such as a given path, which I will take a sequence of nodes and edges. You can subclass an existing structure to include this information, or if this is not possible, you could hash from the structure to the annotation.

- 4,113
- 21
- 32
-
2Can you elaborate on this? I think this is a bit more subtle than you might think it is. You can't necessarily store per-node or per-edge information for a word uniquely, because if you did then you might accidentally have multiple different words share that information. If you were to instead store all the information per-node or per-path separately, you've effectively expanded the DAWG out into a trie, completely eliminating the efficiency gains. – templatetypedef Dec 24 '12 at 20:55
-
^^ I have the same question. Can you provide an example? – Usman Amjed Dec 24 '12 at 21:08
-
To be fair, I answered the question as literally given: "can one annotate paths in a directed weighted acyclic graph". Yes, one can. "Can one efficiently annotate paths in a directed weighted acyclic graph, particularly compared this other implementation?" Likely not. Overall, it may be good to expand the question to give others more context about the intent. – John with waffle Jan 04 '13 at 17:07
Typically, you cannot store per-word information in a DAWG the same way that you could in a trie or other data structure. The reason for this is that multiple different words in a DAWG might all share nodes, so there is a risk that information for one word "leaks" into information for other words.
As a simple example, suppose that we have a DAWG for the words "is", "as", "i", and "a". In that case, the DAWG would look like this:
START
a / \ i
ACC ACC
s \ / s
ACC
Note that the node representing the words "as" and "is" are exactly the same node. Therefore, if you were to try to annotate the word "as" with information, the node holding that information would also be the same as the node for "is," meaning that "as" and 'is" would both pick up the same set of information.
You could try to get around this by storing a map in the node for "as" and "is" that maps from the word ending at that node to the extra information about that word, but this dramatically increases the memory usage of the DAWG. You're now storing each character in the word, so your memory usage will go way up (remember that the whole point of the DAWG is to reduce the memory usage required to store a set of words). You'd be better off just storing a hash table that maps from words to information.
Another option you might try to store this information would be to expand out each path through the DAWG into its own branch, so that the nodes for different words are always different. The problem with this approach, though, is that you're effectively converting the DAWG back into a trie, which dramatically increases the memory usage involved.
In short, there is no straightforward way to annotate words in a DAWG with metainformation without greatly increasing the memory usage. If you have to do this, you're better off using a different data structure.
Hope this helps!

- 362,284
- 104
- 897
- 1,065
-
Which data structure would you suggest? Should i go for a trie? People talk about the memory usage of the trie? But couldn't i like build the trie and then save it; not building it again and again. Plus is the memory problem that huge? – Usman Amjed Dec 24 '12 at 21:42
-
An online reference suggest using this method for a DAWG: "Because the terminal nodes of a DAWG can be reached by multiple paths, a DAWG cannot directly store auxiliary information relating to each path, e.g. a word's frequency in the English language. However, if at each node we store a count of the number of unique paths through the structure from that point, we can use it to retrieve the index of a word, or a word given its index.[1] The auxiliary information can then be stored in an array." – Usman Amjed Dec 24 '12 at 21:43
Yes you can. Each path from start of dawg to end of word is unique, and that path can be indexed as an integer. That index number can then be mapped to auxiliary information.
See the paper here: http://www.ic.unicamp.br/~reltech/1992/92-01.pdf See the a good implementation here: https://github.com/WojciechMula/pyDAWG/blob/master/dawg_mph.c#L37

- 1,694
- 4
- 21
- 40