It's not entirely clear what all your desired behaviours and constraints are. For example, in left-to-right order as you've listed the example filters, could a filter contain "" (your match-all "wildcard"), then another non-"" string?
Anyway, assuming you don't allow filters to have a ""
wildcard followed by a non-""
term, you could create an index of your filters with type...
std::unordered_map<std::string,
std::variant<filter*,
std::unordered_map<std::string,
std::variant<filter*,
std::unordered_map<std::string,
std::variant<filter*,
std::unordered_map<std::string,
std::variant<filter*,
std::unordered_map<std::string, filter*>
> > > > > > > > index;
This effectively describes a tree where each level may have either an exact string match or a ""
wildcard filter match, yielding either a filter*
to the matched filter if that's the last non-""
term in the filter, or the next level of the index.
Given your candidate for filtering, v1
, and assuming you've done the obvious work to populate the index, you can then search like this:
std::vector<filter*>
match(const std::array<string, 5>& v,
const auto& index) {
std::vector<filter*> matches;
auto it0 = index.find("");
if (it0 != index.end())
matches.push_back(std::get<filter*>(*it0)); // match-everything filter
if ((it0 = index.find(v[0])) == index.end())
return matches; // no non-wildcard match
const auto& i1 = std::get<1>(*it0); // next level in index
auto it1 = i1.find("");
...same kind of logic, working your way through the 5 levels...
This involves at most 10 hash table lookups so is O(1) w.r.t. the filter-set size.
That doesn't necessarily mean it'll be the best option performance-wise for any given data: the typical length of the strings, patterns in them such as being likely to start with the same characters and only differ in the last few characters, the length of the filter set - knowledge of these things may suggest an actually-faster approach.
How well the hash table works will also depend on the hash function used. For longish strings, GCC's string hash function - MURMUR32 - is massively higher quality than Visual C++'s (which for speed only incorporates 10 characters spaced along the string), and GCC uses prime bucket counts which can be much less collision prone than Visual C++'s power-of-2 choice. Other optimisations may be suggested by knowledge of the data, e.g. if you know the strings are always < 8 characters, you may choose to put them into a padded uint64_t and do direct equality comparisons, and use an open addressing / closed hashing hash table implementation to reduce CPU cache misses and improve speed.