How do I efficiently tell which of a sequence of regular expressions match a given string?

Question

I'm working on a Ruby project in which the user can configure a list of regular expressions and associate each expression with some action to be performed. This list of expressions is then tested against some given input string, and the handler(s) of the matching expression(s) will be executed. The code basically looks like this:

@mapping = {
    /^[A-Za-z]+$/ => Proc.new { puts "Got word" },
    /^\d+$/ => Proc.new { puts "Got number" },
    /^.$/ => Proc.new { puts "Got single character" }
}

def run_handlers(s)
    @mapping.each { |regexp, handler| handler.call if regexp =~ s.to_s }
end

# Prints 'Got word'
run_handlers("StackOverflow")

# Prints 'Got number' as well as 'Got single character'
run_handlers(4)

Alas, this performs rather poorly when there are many (a couple thousand) handlers. Performing a linear search over all handlers seems a bit wasteful, but since the expressions are user-configurable I cannot exploit some specific traits of the expressions easily. I wonder whether there's a more efficient way to do this.

One idea I was toying with is to merge multiple regular expressions (i.e. multiple state machines) into one big state machine which is able to run more efficiently. I'm aware of the union method to merge regular expressions, but since I need to execute a different set of handlers depending on which expression match, that does not seem to be of any use.

I was thinking that maybe the expressions can internally be managed like a tree structure (much like a Trie such that the program does not actually need to try all the regular expressions but can quickly drop entire sets of expressions.

Maybe there is some existing code for this kind of problem already? I suppose various web frameworks deal with the same kind of issue when getting a request and having to decide which route to select.

@ReggieB Alas, not really. The handlers typically do a lot of I/O. — Frerich Raabe, Dec 22 '16 at 13:07

score 1 · Answer 1 · edited May 23 '17 at 12:33

You can use parenthesis to capture parts of a Regular Expression match. Have a look at this:

['1','a','Z'].each {|x| p /((\d)|([a-z])|([A-Z]))/.match(x).to_a }

Where I am passing three strings to the regex /((\d)|([a-z])|([A-Z]))/ via its match method. I'm then converting the output to an array so that it is easy to compare.

It contains three regex elements:

(\d)    # any number
([a-z]) # lowercase letters
([A-Z]) # uppercase letters

Note that each is enclosed within parenthesis.

They are all then enclosed within an OR block: (a|b|c)

It outputs:

["1", "1", "1", nil, nil]
["a", "a", nil, "a", nil]
["Z", "Z", nil, nil, "Z"]

The first match is the match for the complete regex, and the second is for the outer OR block.

So it is the third to fifth matches that are of interest. They match the inner elements.

The main issue I think you may have with this, is if the inner elements themselves have OR blocks and capture blocks. For example, have a look at this variant of my code:

['1','a','Z'].each {|x| p /((\d)|(([a-g])|([h-z]))|([A-Z]))/.match(x).to_a }

In this I've split the second element into two (([a-g])|([h-z])).

As you can see from the output, that throws out the results:

["1", "1", "1", nil, nil, nil, nil]
["a", "a", nil, "a", "a", nil, nil]
["Z", "Z", nil, nil, nil, nil, "Z"]

So if you have no control on the structure of each regex element, I don't think you'll be able to predict the structure of the output.

Using groups is a nice idea! Do you know if this will still work if I have a few thousand regexps, i.e. is there some practical limit to the number of choices in a group? As for the inner elements capturing data: I guess once I know which exact regexp matches, I could match that one again in isolation to extract the data. — Frerich Raabe, Dec 22 '16 at 09:08
Hm, this approach does not seem to work well for overlapping groups, e.g. `/(\d)|(.)/.match('9').to_a` yields `["9", "9", nil]`. — Frerich Raabe, Dec 22 '16 at 11:28
Yes - I don't think this is the final solution. I wonder if you could identify a way to identify expressions where this grouping would work, and then apply it just to them. If it handled only a fairly small proportion of the expressions it might speed things up a significantly. — ReggieB, Dec 22 '16 at 11:56

How do I efficiently tell which of a sequence of regular expressions match a given string?

1 Answers1