Meta-information in DAWG/DAFSA

Question

I would like to implement a string look-up data structure, for dynamic strings, that will support efficient search and insertion. Currently, I am using a trie but I would like to reduce the memory footprint if possible. This Wikipedia article describes a DAWG/DAFSA, which will obviously save a lot of space over a trie by compressing suffixes. However, while it will clearly test whether a string is legal, it is not obvious to me if there is any way to exclude illegal strings. For example, using the words "cite" and "cat" where the "t" and "e" are terminal states, a DAWG/DAFSA would look like this:

and "cit" and "cate" will be incorrectly recognized as legal strings without some meta-information.

Questions:

1) Is there a preferred way to store meta-information about strings/paths (such as legality) in a DAWG/DAFSA?

2) If a DAWG/DAFSA is incompatible with the requirements (efficient search/insertion and storing meta-information) what's the best data structure to use? A minimal memory footprint would be nice, but perhaps not absolutely necessary.

in DAWG-DAFSA symbols are edges, not nodes. – Jasen Jul 02 '19 at 04:38 — Jasen, Jul 02 '19 at 04:38

score 3 · Answer 1 · answered Dec 17 '14 at 20:13

In a DAWG, you only compress states together if they're completely indistinguishable from one another. This means that you actually wouldn't combine the T nodes for CAT and CITE together for precisely the reason you've noted - that gives you either a false positive on CIT or a false negative on CAT.

DAWGs are typically most effective for static dictionaries when you have a huge number of words with common suffixes. A DAWG for all of English, for example, could save a lot of space by combining all the suffix "s"'s at the end of plural words and most of the "ING" suffixes from gerunds. If you're going to be doing a lot of insertions or deletions, DAWGs are almost certainly the wrong data structure for the job because adding or removing a single word from a DAWG can cause ripple effects that require lots of branches that were previously combined to be split or vice-versa.

Quite honestly, for reasonably-sized data sets, a trie isn't a bad call. A trie for all of English would only use up something like 26MB, which isn't very much. I would only go with the DAWG if space usage really is at a premium and you aren't doing many insertions or deletions.

Hope this helps!

Definitely helps. Many thanks. The Wikipedia article didn't address the issues of false positives/negatives at all. My strings are not limited to English words so I am concerned about the long-term consequences of implementing a trie without planning for some pruning or other maintenance, which led me to wonder about implementing some compression. — Schemer, Dec 17 '14 at 20:18
26 MB might be affordable, but while reducing the memory footprint, we might be also improving look-up times because a larger proportion of the dictionary can fit into the caches. — Evgeni Sergeev, Nov 25 '16 at 10:05

Meta-information in DAWG/DAFSA

1 Answers1