Wildcard match reverse lookup

Question

I have a list of patterns that contain zero or more wildcard chars (*) anywhere in their body:

bleurgh
p0*
p*w
p*w*
*01
*.nowsich.* (dots here are meaningless)

Only wildcards are allowed (this is not full regex of any sort) and they can appear anywhere in the pattern. A wildcard-only pattern is not allowed, and the double-wildcard (**) is nonsensical since it's identical to * (but guaranteed someone will try it.) There are on the order of a 100,000 to a million of these.

The code will see new target strings that it might match:

p01w01
pod01whiskey02
ppp.nowsich.com
aZL8u4qXfg!LooksLikeRandomGarbageToMe!kx961giRVV
callmeishmael

These strings are unbounded but likely to be less than, say, 32 chars, and the code will see them on the order of twice a second (this is a low rate sort of thing.)

I'm looking for a way to reverse lookup the patterns to which the strings might match: (inputs to outputs, here)

m01z01 -> *01
p03w01 -> p0*, p*w*, *01
bleurgh -> bleurgh
www.nowsich.org -> *.nowsich.*
wut -> [the empty list]

I'm not reasonably constrained by memory; faster lookup is definitely better.

The best I can come up with so far is to build an directed graph of pattern components, where each component is the output of split("*", pattern) and trailing wildcards are bound to their leaf:

        ___   {implied root node} 
       /      /    |           \
[bleurgh]  [p0]   [p]          [] // leading wildcard?
           /      /           /  \
         [w]    [w*]       [01]  [.nowsich.*]

...and do a DFS on the tree, selecting those subtrees whose root nodes regex match the pattern recompiled from the subtree's parent nodes (all the way up to the root.) I don't like the idea of what, O(log n)? regexes but the dataset isn't huge. Also, I believe I have to always search the "leading wildcard" branch of the tree, so maybe searching a pair of these trees (one with and one without the leading wildcard) is the way to go.

There is some previous art, which doesn't seem to apply for various reasons:

Good algorithm and data structure for looking up words with missing letters?

Efficient string matching algorithm

Both of those have their own constraints that don't jibe with mine.

The question is, a) does the approach I outlined above make sense; and b) do you have a better approach?

score 1 · Answer 1 · answered Sep 21 '18 at 15:58

Did some actual writing to get this done, and here's what I learned:

This idea does work, but instead of using a tree, use a trie. When testing strings against the patterns it'll let you skip over nodes that have no value and only one child, which reduces the number of regexes you need to do.
Trailing and leading wildcards need special treatment, at least in my case (recall that only the wildcard -- * -- char has special meaning in my case.) If there's no leading or trailing wildcard, you need to include an anchor on the first or last key stored for the pattern. That means that the pattern *p0* and p0* are represented by two nodes in the tree. (Similarly for the leaf nodes -- i.e., the final token in the pattern.)
Another note specific to my situation: I needed to escape everything that's meaningful to regex but not meaningful to me -- e.g., chars ., [, ], {, }, and so on. It'd be possible to include these in the pattern matching trie but you'd have to tokenize differently.
Storing a compiled regex against which to match is a speed boost. In my case, this is being written in Go, so I could store the result of regexp.Compile([interstitial-pattern]) for every node in the tree.

Wildcard match reverse lookup

1 Answers1