What is the most efficient way to search a blob of text for an array of regular expressions?

Question

I'm looking for the most efficient way to search a blob of text (± 1/2KB) for many regular expressions stored in an array.

Example code:

patterns = [/patternA/i,/patternB/i,/patternC/m,...,/patternN/i]

content  = "Lorem ipsum dolor sit amet, consectetur... officiam id est laborum."

r = patterns.collect{ |pattern|

  pattern unless ( content =~ pattern ).blank?

}.compact

Where r now contains patterns that matched the content string.

I recently asked a similar question - http://stackoverflow.com/questions/5395236/is-there-an-efficient-way-to-perform-hundreds-of-text-substitutions-in-ruby - and depending on your patterns, there may be some helpful suggestions there. — Simon, Apr 14 '11 at 15:57

sawa · Answer 1 · 2011-04-14T22:41:47.360

2

Solution 1

Do this:

r = patterns.select{|pattern| content =~ pattern}

Since the string is huge, it is better to implement this method on String rather then on something else because passing a large argument seems to be slow.

class String
  def filter_patterns patterns
    patterns.select{|r| self =~ pattern}
  end
end

and use it like:

content.filter_patterns(patterns)

Solution 2

it has restrictions that each regex does not include a named/numbered capture.

combined_regex = Regexp.new(patterns.map{|r| "(?=[.\n]*(#{r.source}))?"}.join)
content =~ combined_regex

The following part will have problem if the regex inside patterns include a named/numbered capture. If there is a way to know for each regex how many potential captures there are, then it will solve the problem.

r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}

Addition

Given:

dogs = {
  'saluki' => 'Hounds',
  'russian wolfhound' => 'Hounds',
  'italian greyhound' => 'Hounds',
   ..
}
content = "Running in the fields at great speeds, the sleek saluki dog comes from..."

you can do this:

combined_regex =
    Regexp.new(dogs.keys.map{|w| "(?=[.\n]*(#{w}))?"}.join, Regexp::IGNORECASE)
content =~ combined_regex
r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}
"This article talks about #{r.collect{|x| dogs[x]}.to_sentence}."
=> "This article talks about Hounds."

To avoid outputs like This article talks about Hounds, Hounds and Hounds., you might want to put uniq in it.

"This article talks about #{r.uniq.collect{|x| dogs[x]}.to_sentence}."

edited Apr 14 '11 at 22:41

answered Apr 14 '11 at 15:45

sawa

165,429
45
277
381

Hey Sawa, thanks for your quick reply! Could you explain what is making this more efficient? Not checking on nil (or rather blank? which is a collection of tests)? And using select instead of collect to avoid nil values in the resulting array requiring the "compact" method to be applied? – Marca Apr 14 '11 at 15:49
`select` is efficient in that it lets through only patterns that are non-nil, so you do not have to filter it with `compact`. I did not use `empty?` because I assumed that your patterns do not include one that matches an empty string. Does it? – sawa Apr 14 '11 at 15:54
No, it doesn't contain a pattern that matches an empty string. – Marca Apr 14 '11 at 16:15
Solution #2 doesn't work for me for the reasons evoked in my comment of j.w.'s answer. – Marca Apr 14 '11 at 16:37
"passing a large argument seems to be slow"? Do you have any basis for this statement? – Rein Henrichs Apr 14 '11 at 19:13
I don't know how ruby deals with arguments but chances are that it refers to a pointer unless the argument is an Object.dup so I too don't think there is a performance issue when passing large arguments. – Marca Apr 14 '11 at 21:22
@sawa, with your solution, there's no way I can have such patterns :
`dogs = [ { :race => 'Hounds', :pattern => /(saluki|russian wolfhound|italian greyhound)/i }, .. ] content = "Running in the fields at great speeds, the sleek saluki dog comes from..." r = dogs.select{ |dog| content =~ dog.pattern } "This article talks about #{r.collect{ |x| x.race }.to_sentence}." => "This article talks about Hounds."` – Marca Apr 14 '11 at 21:35
Sorry but I can't find how to add new lines to comments. Mardown syntax is usually "two space and a carriage return" but it won't work here. Here's a cleaner pastie : http://pastie.org/private/ebeulum9bzp8q5fibkt6w – Marca Apr 14 '11 at 21:37
@Marca I see you used some version of my first answer. What is wrong with it? – sawa Apr 14 '11 at 22:06
@Marca are the regexes that you are going to use simple phrases as you gave in the pastie? Then, if you separate the disjunctions, the problem is simpler. You can go with my second version. – sawa Apr 14 '11 at 22:14
@sawa, indeed I used the .select method in this example instead as you recommended. – Marca Apr 14 '11 at 22:21
@sawa, I'm afraid I don't really understand the "if you separate the disjunctions, the problem is simpler" what do you mean ? – Marca Apr 14 '11 at 22:22
You have disjunctions in your regex. If you separate them into individual regexes, that corresponds to your `pattern`, then you can use my second solution. – sawa Apr 14 '11 at 22:24
@Marca You said there is no way you can have such patterns. I didn't get that. – sawa Apr 14 '11 at 22:26
@sawa, I think I suppose I'm basically stuck with recursion. – Marca Apr 14 '11 at 22:29
I added to my answer. I think having the hash as in my answer is more standard than using an array like in yours. – sawa Apr 14 '11 at 22:42
@sawa, thanks for your answers, I'll benchmark both iteration and building of a hash + building a big regex query and inspecting the matches. Considering there will be a huge number of patterns to match the string against, I don't know which way will be the fastest. Will post the resulting benchmarks here. – Marca Apr 15 '11 at 00:10

score 2 · Answer 2 · answered Apr 14 '11 at 15:54

2

If you are only interested in whether any of the patterns match the text, then consider combining all patterns into a single big regex, using the regex 'or' operator, and compiling that giant regex once.

For instance, if your patterns are: A, B, C, create a single regex of the form A|B|C

Sorry, I don't know Ruby, but hopefully you can turn that into code (:

Side Note: This is how Mercurial's .hgignore files are handled last I looked. In that case there are 1000s of filenames that get thrown at the one big regex, which is more efficient than those filenames getting thrown at each of hundreds of smaller regexes.

answered Apr 14 '11 at 15:54

jwd

10,837
3
43
67

I also though about that. For that purpose you can do `Regexp.union(A, B, C)`. But the question seems to be asking to get which one matches. – sawa Apr 14 '11 at 15:58
Yes I need to find what pattern matches. Consider an Expression model which contains a _pattern_ attributes being a regular expression. `Expression.all.select{ |expression| content =~ expression.pattern }` => `[<#Expression>, <#Expression>, ..., <#Expression>]` i.e. a list of expressions for which their _pattern_ attribute matches _content_. So I can't really use j.w.'s solution here (i.e. merging patterns and build a huge regexp.) – Marca Apr 14 '11 at 16:25

the Tin Man · Answer 3 · 2011-04-15T15:12:33.200

1

How about:

text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor magna'
targets = [ /(am?et)/, /(ips.m)/, /(elit)/, /(magna)/, /([Ll]or[eu]m)/ ]

regex = Regexp.union(targets)

hits = []
text.scan(regex) { |a| hits += a.each_with_index.to_a }
r = hits.select{ |w,i| w }.map{ |w,i| targets[i]} # => [/([lL]or[eu]m)/, /(ips.m)/, /(am?et)/, /(elit)/, /(magna)/]

This works to return the matched patterns in the order that the words were found in the text.

There's probably a way to do it using named-captures too.

edited Apr 15 '11 at 15:12

answered Apr 15 '11 at 04:18

the Tin Man

158,662
42
215
303

I tried `union` (as in a previous edit of my answer), but realized that, once a matching succeeds, it stops there, and does not see if alternative regexes match. – sawa Apr 15 '11 at 05:32
Normally a regex is happy when all its conditions are met. With a union, that will be if any of the `|` joined patterns matches because they're `or'd` together. `Scan` finds all matches by walking through the string, retrying the pattern each time. If a different part of the regex union hits you'll still see it. – the Tin Man Apr 15 '11 at 05:40
I see. Your `union` works because the regexes are (probably mutually exclusive) words/phrases, as given in the linked page provided in a comment by the questioner. (When this question was first asked, that was not clear; I had to consider the possibility that more than one of the regexes might match the same portion of the string. In that case, `union` could not be used.) – sawa Apr 15 '11 at 06:25

score 0 · Answer 4 · answered Apr 15 '11 at 04:56

What you want is exactly what a lexer has been designed to do. Pick out a set of regular expressions from an input stream with only a single pass over the input required.

Unfortunately I haven't been able to find a good lexer gem for Ruby which lets you define your own lexer. I'll update the answer if I find anything.

What is the most efficient way to search a blob of text for an array of regular expressions?

4 Answers4