Efficient string matching algorithm

Question

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
Entries may be absolute matches, i.e. www.example.com.
Entries may include wildcards, i.e. *.example.com.
Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.

Language/environment: C# (.Net Framework 3.5)

I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.

I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.

My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

the question is? / I would use Regex btw, just make sure to have it the expression compiled once (instead of it being calculated again and again). — eglasius, Mar 05 '09 at 00:31
What do you mean by "feels slow"? Have you actually measured anything? — Ed S., Mar 05 '09 at 00:48
How many items to you expect to be in your search list? Will all of these items be in memory? Have you considered using a database? — overslacked, Mar 05 '09 at 01:18
@Ed: slow is relative. I'm trying to determine if there's a more efficient way using string algorithms. — jro, Mar 05 '09 at 01:31
Search list will be loaded in memory. It's relatively insubstantial -- thousands of entries. — jro, Mar 05 '09 at 01:32

score 12 · Accepted Answer · edited May 23 '17 at 12:20

12

If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.

Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.

The only differences between this and a regular spell-checking tree are:

The matching needs to be done in reverse
You have to take into account the wildcards

To address point #2, you would simply check for the "*" character at the end of a test.

A quick example:

Entries:

*.fark.com
www.cnn.com

Tree:

m -> o -> c -> . -> k -> r -> a -> f -> . -> *
                \
                 -> n -> n -> c -> . -> w -> w -> w

Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.

Checking www.cern.com would fail on the second "n" of n,n,c,...

Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

edited May 23 '17 at 12:20

Community

1
1

answered Mar 05 '09 at 00:54

e.James

116,942
41
177
214

This is a great way to do it, assuming that a "*" can only appear at the end (which seems reasonable here). You might get a speedup if you flatten the trie to an array of strings at some depth (e.g. 8 chars) since at this depth the trie is quite sparse and following per-char pointers is slow. – j_random_hacker Mar 05 '09 at 02:41
Good suggestion. The pointer following overhead will slow things down a touch. I wonder what the optimal array depth is? As usual, we're going to need some profiling to see if our solutions are actually helping :) – e.James Mar 05 '09 at 03:23
This is best memory hog out there. ( i've actually implemnted this once ). With virtual memory, i'm not too sure of performance also. – vrdhn Mar 05 '09 at 15:41
There is no doubt that it will take up a lot of memory. The simplest approach would use around 50 to 100 times the memory of the original list. I imagine you'd be looking at about 15-20 MB for the first thousand entries. – e.James Mar 05 '09 at 18:21
As with any design decision, the memory cost will have to be weighed against the performance improvement, and both will have to be measured. I described the fastest method I could think of here, because the OP was asking about performance. – e.James Mar 05 '09 at 18:24

score 5 · Answer 2 · answered Mar 05 '09 at 00:32

5

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

answered Mar 05 '09 at 00:32

eglasius

35,831
5
65
110

I think that regular expressions, even with precompilation, would be very much overkill and inefficient for matching of this simple nature. Then again, I think most naive approaches would also be very inefficient, so maybe this is the best easy solution. +1 – Sparr Mar 05 '09 at 00:47

score 5 · Answer 3 · answered Mar 05 '09 at 00:56

you don't need regexp .. just reverse all the strings, get rid of '*', and put a flag to indicate partial match till this point passes.

Somehow, a trie or suffix trie looks most appropriate.

If the list of domains is known at compile time, you may look at tokenizing at '.' and using multiple gperf generated machines.

Links: google for trie http://marknelson.us/1996/08/01/suffix-trees/

score 3 · Answer 4 · answered Mar 05 '09 at 00:40

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.

Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.

To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.

This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.

Additionally, I would have to bet that this approach would be faster than RegEx.

I was about to say something almost exactly like this, but I scrolled down and there it was! +1 — Greg Hewgill, Mar 05 '09 at 00:43

score 2 · Answer 5 · answered Mar 05 '09 at 00:34

2

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.

Here's an article on recursive descent parsers.

answered Mar 05 '09 at 00:34

Erik Forbes

35,357
27
98
122

"you might consider using a hand-written LL parser for this" - don't you think that might be overkill? – Mitch Wheat Mar 05 '09 at 00:35
Not if he's dead-set against Regexes. – Erik Forbes Mar 05 '09 at 00:37
Besides - a regex engine is just a DFA - building your own DFA to consider *only* the rules you want can be faster, provided you know what you're doing. – Erik Forbes Mar 05 '09 at 00:38

score 1 · Answer 6 · answered Mar 05 '09 at 00:53

Assuming the rules are as you said: literal or start with a *.

Java:

public static boolean matches(String candidate, List<String> rules) {
    for(String rule : rules) {
        if (rule.startsWith("*")) {
            rule = rule.substring(2);
        }
        if (candidate.endsWith(rule)) {
            return true;
        }
    }
    return false;
}

This scales to the number of rules you have.

EDIT:

Just to be clear here.

When I say "sort the rules", I really mean create a tree out of the rule characters.

Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).

For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.

If you have a LOT (LOT LOT) of rules I would sort the rules.

For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)

If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.

score 1 · Answer 7 · answered Mar 05 '09 at 01:38

1

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

answered Mar 05 '09 at 01:38

Hank Gay

70,339
36
160
222

score 1 · Answer 8 · answered Mar 05 '09 at 04:01

1

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

answered Mar 05 '09 at 04:01

Kenneth Cochran

11,954
3
52
117

score 0 · Answer 9 · answered Mar 05 '09 at 00:35

0

Have a look at RegExLib

answered Mar 05 '09 at 00:35

Mitch Wheat

295,962
43
465
541

score 0 · Answer 10 · answered Mar 05 '09 at 00:42

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:

Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

score 0 · Answer 11 · answered Mar 05 '09 at 00:46

Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.

What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.

score 0 · Answer 12 · answered Mar 05 '09 at 00:55

0

I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?

I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex

answered Mar 05 '09 at 00:55

Scott Evernden

39,136
15
78
84

Our string comparison tests are nowhere near 90% on first-pass. – jro Mar 05 '09 at 06:48

score 0 · Answer 13 · 2009-03-05T04:50:52.900

If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.

That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.

If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...

score 0 · Answer 14 · answered Mar 05 '09 at 05:15

Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.

One source of information for these is:

KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Efficient string matching algorithm

14 Answers14

Linked