Given a string, how to iterate through many regexes to find a match?

Question

Given a string, and a mapping of regexes to integers, I want to find out which integer the string maps to (assume the string will match exactly one of those regexes)

Is it inefficient to just iterate through the hash of regexes, trying each regex against the string, then outputting the value? Certainly I cannot explicitly enumerate all possible string => integer mappings, but it seems bad to try matching each regex in bunch of regexes.

"Is it inefficient to just iterate through the hash of regexes ...?" - No, because you also say, that this is what you want to achieve. Maybe describe, how you would imagine a shortcut, or what your *big picture* really is ... — miku, Jan 19 '11 at 01:28
There isn't a way to test a string against a regular expression without doing it. It might be helpful if you share a little about the bigger problem you're trying to solve and the purpose of the regex to integer mapping. — Jimmy, Jan 19 '11 at 01:28
I think you need to show some examples of what you mean, or are looking for. Or, better yet, show sample code so we get an idea where you are headed. — the Tin Man, Jan 19 '11 at 02:05
The question shouldn't be "is it inefficient." It should be, "is it efficient enough?" Most likely it is. Benchmark it and find out, and only if it is indeed too slow should you write anything other than the clearest, simplest possible code. — Wayne Conrad, Jan 19 '11 at 09:38

score 1 · Answer 1 · edited May 23 '17 at 12:02

RegexpTrie, which wasn't around when I last looked for something like it, helps with this sort of problem:

require 'regexp_trie'

sentence = 'life on the mississippi'
words_ary = %w[the sip life]

words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i 
# => /\b(?:(?:the|sip|life))\b/i

words_to_ints = words_ary.each_with_index.to_h
# => {"the"=>0, "sip"=>1, "life"=>2}

sentence_words = sentence.split
# => ["life", "on", "the", "mississippi"]

word_hits = sentence_words.map { |w| w[words_regex] }
# => ["life", nil, "the", nil]

nil means there was no match of that word in the regular expression.

words_to_ints.values_at(*word_hits)
# => [2, nil, 0, nil]

Again, nil means there was no match. nil values could be ignored using:

word_hits = sentence_words.map { |w| w[words_regex] }.compact
# => ["life", "the"]

words_to_ints.values_at(*word_hits)
# => [2, 0]

Similarly, if you want to scan a sentence for word matches instead of individual words:

require 'regexp_trie'

sentence = 'life on the mississippi'
words = %w[the sip life]

words_regex = /\b(?:#{RegexpTrie.union(words, option: Regexp::IGNORECASE).source})\b/i 
# => /\b(?:(?:the|sip|life))\b/i

words_to_ints = words.each_with_index.to_h
# => {"the"=>0, "sip"=>1, "life"=>2}

word_hits = sentence.scan(words_regex)
# => ["life", "the"]

words_to_ints.values_at(*word_hits)
# => [2, 0]

Perl has a really useful module for this sort of thing called Regexp::Assemble, which lets you combine regexes into one big one, then search a string, returning the hits. You can ask it to tell which pattern was used if you want to know.

Ruby doesn't have such a module, but this gets kinda close:

patterns = {
  /(foo)/ => 1,
  /(bar)/ => 2
}

pattern_union = Regexp.union(patterns.keys)

pattern_union # => /(?-mix:(foo))|(?-mix:(bar))/

str = 'foo some text'

if (pattern_union =~ str)

  # these show what are being processed...
  pattern_union.match(str).captures # => ["foo", nil]
  pattern_union.match(str).captures.zip(patterns.keys).find_all{ |c| c[0] }.map{ |c| c[1] } # => [/(foo)/]

  # process it...
  matched_pattern_values = patterns.values_at(*pattern_union.match(str).captures.zip(patterns.keys).find_all{ |c| c[0] }.map{ |c| c[1] })

  # here's what we got
  matched_pattern_values # => [1]

end

There's probably a way to do it in one line, but this works.

I think its important to avoid having to iterate over patterns to look for hits in strings if at all possible, because they can slow down badly as the size of the text or number of patterns increase.

See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more about using Regexp::Assemble from Ruby.

score 1 · Accepted Answer · answered Jan 19 '11 at 06:22

Just do as you suggest, loop over the hash of regex/numbers and return the first that matches a string:

def find_match(regex_mapping, str)
  regex_mapping.each do |regex, n|
    return n if str =~ regex
  end
  return nil
end

The only thing to say about efficiency is this: it probably doesn't matter anyway. Just write your code as clearly and simply as you can, and then, in the end, if it is to slow, run it through a profiler (for example the absolutely awesome perftools.rb) and see what the hotspots are. Optimize those. Don't optimize before you've written any code.

That said, a simple optimization you can do in this case, which doesn't cost you anything, is to put the regexes into the mapping hash in an order such that the most likely to match comes first, that way fewer comparisons have to be made (but this is a probabilistic optimization, the worst case running time remains the same). This only applies to Ruby 1.9 though, since hashes don't retain their insertion order in 1.8.

score 0 · Answer 3 · answered Jan 19 '11 at 03:05

It depends on how complicated your regexes are. If you can put them in capture blocks and have the capture blocks map back to the integers that you need then you should be ok.

For instance:

(is).*(test)

has two capture blocks. This will match:

This is a test

The captures will be 1:is and 2:test

You can quickly try it on http://www.rubular.com/

score -1 · Answer 4 · answered Jan 19 '11 at 06:26

You say 'it seems bad', but in the end, there is probably nothing you can do: you will have to match each string to a succession of regexes until one matches. You can memoize the results and be smart in other ways, like 'if this regex fails, these other 10 will also fail', but those are all performance optimizations that you may not need.

The simplest optimization may be to create groups of regexes with shared characteristics and first test which group the string is in. If string =~ /^a/ is nil, all other regexes that test for a string starting with a don't need to be tested anymore.

Given a string, how to iterate through many regexes to find a match?

4 Answers4