4

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

  • Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
  • Entries may be absolute matches, i.e. www.example.com.
  • Entries may include wildcards, i.e. *.example.com.
  • Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
  • Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.

Language/environment: C# (.Net Framework 3.5)

I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.

I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.

My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

jro
  • 7,372
  • 2
  • 23
  • 36
  • the question is? / I would use Regex btw, just make sure to have it the expression compiled once (instead of it being calculated again and again). – eglasius Mar 05 '09 at 00:31
  • What do you mean by "feels slow"? Have you actually measured anything? – Ed S. Mar 05 '09 at 00:48
  • How many items to you expect to be in your search list? Will all of these items be in memory? Have you considered using a database? – overslacked Mar 05 '09 at 01:18
  • @Ed: slow is relative. I'm trying to determine if there's a more efficient way using string algorithms. – jro Mar 05 '09 at 01:31
  • Search list will be loaded in memory. It's relatively insubstantial -- thousands of entries. – jro Mar 05 '09 at 01:32

14 Answers14

12

If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.

Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.

The only differences between this and a regular spell-checking tree are:

  1. The matching needs to be done in reverse
  2. You have to take into account the wildcards

To address point #2, you would simply check for the "*" character at the end of a test.

A quick example:

Entries:

*.fark.com
www.cnn.com

Tree:

m -> o -> c -> . -> k -> r -> a -> f -> . -> *
                \
                 -> n -> n -> c -> . -> w -> w -> w

Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.

Checking www.cern.com would fail on the second "n" of n,n,c,...

Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

Community
  • 1
  • 1
e.James
  • 116,942
  • 41
  • 177
  • 214
  • This is a great way to do it, assuming that a "*" can only appear at the end (which seems reasonable here). You might get a speedup if you flatten the trie to an array of strings at some depth (e.g. 8 chars) since at this depth the trie is quite sparse and following per-char pointers is slow. – j_random_hacker Mar 05 '09 at 02:41
  • Good suggestion. The pointer following overhead will slow things down a touch. I wonder what the optimal array depth is? As usual, we're going to need some profiling to see if our solutions are actually helping :) – e.James Mar 05 '09 at 03:23
  • This is best memory hog out there. ( i've actually implemnted this once ). With virtual memory, i'm not too sure of performance also. – vrdhn Mar 05 '09 at 15:41
  • There is no doubt that it will take up a lot of memory. The simplest approach would use around 50 to 100 times the memory of the original list. I imagine you'd be looking at about 15-20 MB for the first thousand entries. – e.James Mar 05 '09 at 18:21
  • As with any design decision, the memory cost will have to be weighed against the performance improvement, and both will have to be measured. I described the fastest method I could think of here, because the OP was asking about performance. – e.James Mar 05 '09 at 18:24
5

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

eglasius
  • 35,831
  • 5
  • 65
  • 110
  • I think that regular expressions, even with precompilation, would be very much overkill and inefficient for matching of this simple nature. Then again, I think most naive approaches would also be very inefficient, so maybe this is the best easy solution. +1 – Sparr Mar 05 '09 at 00:47
5

you don't need regexp .. just reverse all the strings, get rid of '*', and put a flag to indicate partial match till this point passes.

Somehow, a trie or suffix trie looks most appropriate.

If the list of domains is known at compile time, you may look at tokenizing at '.' and using multiple gperf generated machines.

Links: google for trie http://marknelson.us/1996/08/01/suffix-trees/

vrdhn
  • 4,024
  • 3
  • 31
  • 39
3

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.

Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.

To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.

This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.

Additionally, I would have to bet that this approach would be faster than RegEx.

Ecton
  • 10,702
  • 2
  • 35
  • 44
2

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.

Here's an article on recursive descent parsers.

Erik Forbes
  • 35,357
  • 27
  • 98
  • 122
1

Assuming the rules are as you said: literal or start with a *.

Java:

public static boolean matches(String candidate, List<String> rules) {
    for(String rule : rules) {
        if (rule.startsWith("*")) {
            rule = rule.substring(2);
        }
        if (candidate.endsWith(rule)) {
            return true;
        }
    }
    return false;
}

This scales to the number of rules you have.

EDIT:

Just to be clear here.

When I say "sort the rules", I really mean create a tree out of the rule characters.

Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).

For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.

If you have a LOT (LOT LOT) of rules I would sort the rules.

For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)

If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.

Will Hartung
  • 115,893
  • 19
  • 128
  • 203
1

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

Hank Gay
  • 70,339
  • 36
  • 160
  • 222
1

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

Kenneth Cochran
  • 11,954
  • 3
  • 52
  • 117
0

Have a look at RegExLib

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
0

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:

Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

Ed Marty
  • 39,590
  • 19
  • 103
  • 156
0

Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.

What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.

JaredReisinger
  • 6,955
  • 1
  • 22
  • 21
0

I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?

I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex

Scott Evernden
  • 39,136
  • 15
  • 78
  • 84
0

If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.

That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.

If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...

0

Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.

One source of information for these is:

KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278