0

I am currently using a trie implementation from this stack overflow post:

Getting a list of words from a Trie

to return a list of words which match a given prefix. I'm then using regex to filter out the words which don't meet the entire specified pattern.

EX: if the pattern I'm searching for is: CH??S? and this is a subset of the dictionary which matches my initial prefix: {CHABAD, CHACHA, CHARIOT, CHATTED, CHEATER, CHOMSKY, CHANNEL CHAFED, CHAFER, CHAINS, CHAIRS, CHEESE, CHEESY CHRONO, CHUTES, CHISEL}

I would search the trie with 'CH' prefix and then filter out words which match my desired pattern of CH??S? (CHEESY, CHEESE, CHISEL) and return those.

I am wondering if there is a faster way to do this to avoid using the regex in the final step. I thought I could use a suffix tree (Ukkonen's suffix tree algorithm in plain English )or the boyer-moore algorithm but neither work because they search on suffixes not on patterns.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
219CID
  • 340
  • 5
  • 15
  • Can you turn your prefix search into a suffix search? For `CH??S?` search for `CH??` and then search for all of those plus `S?`. – markspace Apr 22 '20 at 17:26

1 Answers1

1

Here's a nice recursive algorithm you can use that eliminates the need to use a final regex pass. It works by matching a pattern P against a tree T:

FindMatches(pattern P, tree T) {
    if (tree is empty) {
        return that there are no matches;
    }

    if (pattern is empty) {
        report a match if T is a word;
    } else if (pattern[0] is a letter) {
        FindMatches(P[1:], T.childFor(pattern[0]));
    } else if (pattern[0] is ?) {
        for (each child C of T) {
             gather matches from FindMatches(P[1:], C);
        }
        report all matches found this way;
    } else {
        something is wrong with your pattern;
    }
}

Your approach of just matching CH follows along with this approach. However, when you get to the point where you're matching ? characters, your approach just does a DFS to find all words below the current point, whereas this approach instead just branches on one level and gathers up matches. Stated differently, this approach is a hybrid between "follow one character at a time if there's a clear character to follow" and "explore the whole subtree looking for matches," focusing the branching strategically.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065