I have a list of patterns that contain zero or more wildcard chars (*
) anywhere in their body:
bleurgh
p0*
p*w
p*w*
*01
*.nowsich.* (dots here are meaningless)
Only wildcards are allowed (this is not full regex of any sort) and they can appear anywhere in the pattern. A wildcard-only pattern is not allowed, and the double-wildcard (**
) is nonsensical since it's identical to *
(but guaranteed someone will try it.) There are on the order of a 100,000 to a million of these.
The code will see new target strings that it might match:
p01w01
pod01whiskey02
ppp.nowsich.com
aZL8u4qXfg!LooksLikeRandomGarbageToMe!kx961giRVV
callmeishmael
These strings are unbounded but likely to be less than, say, 32 chars, and the code will see them on the order of twice a second (this is a low rate sort of thing.)
I'm looking for a way to reverse lookup the patterns to which the strings might match: (inputs to outputs, here)
m01z01 -> *01
p03w01 -> p0*, p*w*, *01
bleurgh -> bleurgh
www.nowsich.org -> *.nowsich.*
wut -> [the empty list]
I'm not reasonably constrained by memory; faster lookup is definitely better.
The best I can come up with so far is to build an directed graph of pattern components, where each component is the output of split("*", pattern)
and trailing wildcards are bound to their leaf:
___ {implied root node}
/ / | \
[bleurgh] [p0] [p] [] // leading wildcard?
/ / / \
[w] [w*] [01] [.nowsich.*]
...and do a DFS on the tree, selecting those subtrees whose root nodes regex match the pattern recompiled from the subtree's parent nodes (all the way up to the root.) I don't like the idea of what, O(log n)
? regexes but the dataset isn't huge. Also, I believe I have to always search the "leading wildcard" branch of the tree, so maybe searching a pair of these trees (one with and one without the leading wildcard) is the way to go.
There is some previous art, which doesn't seem to apply for various reasons:
Good algorithm and data structure for looking up words with missing letters?
Efficient string matching algorithm
Both of those have their own constraints that don't jibe with mine.
The question is, a) does the approach I outlined above make sense; and b) do you have a better approach?