Fast algorithm for string/(vector of strings) matching

Question

So I have a vector V1 of exactly 5 strings.

And I have a set of vectors (called filters) (each with 5 strings).

I need an fast algorithm so I can see if the given vector matches any of the vectors in the set.

The vector V1 mathes a vector from the set (for example filter1) if

V1[i] = filter1[i] OR filter[i] = "" (empty string)

Example:

Filters:

filter1: "abc" "bcd" "bb" "" ""

filter2: "abc" "bcd" "bb" "ee" "ff"

filter3: "abc" "ddd" "bb" "j" ""

Searched vector:

V1: "abc", "bcd, "bb", "ee", "ff"

V1 matches two vectors in the filters: filter1 and filter2.

So I was thinking for storing the filters in an unordered_set. But I don't know how to make the hash function, so it could find an match (it will give diffrent hash value for the diffrent vectors (even thou they can match). The other idea I had is to construct a regular expression. But again the search would be O(n). Any tips how can I check for match for O(log(n)) or O(1) (where n is the size of filters vector)?

`std::regex` matching will probably give you the best performance on bigger amounts og input data. — πάντα ῥεῖ, Jan 17 '21 at 20:51

score 0 · Answer 1 · answered Jan 17 '21 at 22:02

It's not entirely clear what all your desired behaviours and constraints are. For example, in left-to-right order as you've listed the example filters, could a filter contain "" (your match-all "wildcard"), then another non-"" string?

Anyway, assuming you don't allow filters to have a "" wildcard followed by a non-"" term, you could create an index of your filters with type...

std::unordered_map<std::string,
  std::variant<filter*,
    std::unordered_map<std::string,
      std::variant<filter*,
        std::unordered_map<std::string,
          std::variant<filter*,
            std::unordered_map<std::string,
              std::variant<filter*,
                std::unordered_map<std::string, filter*>
> > > > > > > > index;

This effectively describes a tree where each level may have either an exact string match or a "" wildcard filter match, yielding either a filter* to the matched filter if that's the last non-"" term in the filter, or the next level of the index.

Given your candidate for filtering, v1, and assuming you've done the obvious work to populate the index, you can then search like this:

std::vector<filter*>
match(const std::array<string, 5>& v,
      const auto& index) {
    std::vector<filter*> matches;

    auto it0 = index.find("");
    if (it0 != index.end())
         matches.push_back(std::get<filter*>(*it0)); // match-everything filter
    if ((it0 = index.find(v[0])) == index.end())
        return matches; // no non-wildcard match

    const auto& i1 = std::get<1>(*it0); // next level in index
    auto it1 = i1.find("");
    ...same kind of logic, working your way through the 5 levels...

This involves at most 10 hash table lookups so is O(1) w.r.t. the filter-set size.

That doesn't necessarily mean it'll be the best option performance-wise for any given data: the typical length of the strings, patterns in them such as being likely to start with the same characters and only differ in the last few characters, the length of the filter set - knowledge of these things may suggest an actually-faster approach.

How well the hash table works will also depend on the hash function used. For longish strings, GCC's string hash function - MURMUR32 - is massively higher quality than Visual C++'s (which for speed only incorporates 10 characters spaced along the string), and GCC uses prime bucket counts which can be much less collision prone than Visual C++'s power-of-2 choice. Other optimisations may be suggested by knowledge of the data, e.g. if you know the strings are always < 8 characters, you may choose to put them into a padded uint64_t and do direct equality comparisons, and use an open addressing / closed hashing hash table implementation to reduce CPU cache misses and improve speed.

If I understand it right then you need a perfect hash functions to guarantee a lookup of O(1) in a hash map or am I overlooking something here. — Surt, Jan 17 '21 at 23:02
@Surt: yes, you're overlooking something - namely, what O(1) means. Check [this answer](https://stackoverflow.com/a/30567466/410767) under "Linked list lengths relate to load factor, not the number of values", and if you're that doesn't help I'd welcome another question here or there. — Tony Delroy, Jan 18 '21 at 02:23
if I look at [hash table](https://en.wikipedia.org/wiki/Hash_table) under worst case I see O(N). — Surt, Jan 18 '21 at 10:10
@Surt: yes, but if you use a strong hash function, one you get a few hundred elements or more the probability of that happening is far less than the chance of your computer getting hit by an meteor before it can finish the lookup, so can very reasonably be ignored. You do need to know how to write or select an appropriate hash function though, and deal with deliberate adversarial key generation if that's relevant to your scenario (e.g. a website with open source code where someone could see your hash function and engineer colliding keys). — Tony Delroy, Jan 18 '21 at 20:47

Fast algorithm for string/(vector of strings) matching

1 Answers1