Hundreds of RegEx on one string

Question

I am using Java/Groovy to find matches(and extract them) on a string through RegEx. What is the best way of finding matches of 200 or more regex on a string of, lets say, 5000 characters in terms of performance. In a nutshell, is it possible to avoid scanning the string for each RegEx?

I can use the Pattern and Matcher classes provided by java but then I will have to compile 200 patterns and then pass the string to matcher 200 times. Is that the only way of doing it?

Do your patterns have any similarites? Could you search for a few generic patterns (e.g. A.A), extract the substring and then pattern match it on the more specific ones (ABA, ACA)? — Paolo, Aug 11 '11 at 07:37
What are those regexes? Are they in order? If not, you'd have to scan the whole string for each regex anyways. If yes, you might start after the last match. — Thomas, Aug 11 '11 at 07:37
Are you using "real" regexes or are you simply looking for words (a full text search). In the latter case, some index might reduce the runtime. — Jens, Aug 11 '11 at 07:45
@Paolo you are right I can make substrings but still each substring will have to go through 10-15 regexes. — Sap, Aug 11 '11 at 09:01
@Thomas No they are not in order, the matches in the string can be at any position. — Sap, Aug 11 '11 at 09:03

score 3 · Accepted Answer · answered Aug 11 '11 at 07:53

3

If your regexes do not have common matches you can always combine them in a gigantic one by using alternatives, e.g.

( regex1 ) | ( regex2 ) | .... | ( regexN )

However given the complexity of your problem I think you should consider switching from regexes to a proper scanner/parser combination. It will take time upfront, but the resulting solution will be much more manageable. Why don't you check out Antlr?

answered Aug 11 '11 at 07:53

Nicola Musatti

17,834
2
46
55

+1 for using a proper parser. It *will* make your life much easier. – exhuma Aug 11 '11 at 08:43
Will Antlr work for loosely defined grammer? For example if I want to capture height of a person from a statement, there are so many ways to express the fact. For Ex: his height is 6 feet, he is 6 feet tall etc. I was thinking of compiling list of regex which can extract height from most common types of sentences. Also, I am not interested in capturing just one such fact there are 20 30 more – Sap Aug 11 '11 at 09:19
Antlr is oriented towards recognizing formal languages, so it is probably the most specific tool you can find. On the other hand what you might be able to do is to isolate specific keywords and constructs while ignoring the rest. You would probably end up with something that is much more structured than a bunch of regexps, but it will take some work to explicitly handle stuff you don't care about. – Nicola Musatti Aug 11 '11 at 09:42
@Nicola Mussatti Just to confirm, do you still think that Antlr can be of rescue for such extraction from natural language? – Sap Aug 11 '11 at 10:32
As a substitute for a bunch of regexps I believe it may be useful, but it certainly isn't a perfect match for your problem, especially if you don't have previous experience with parser generators. In your place I wouldn't give up on Antlr without reading a bit about it. – Nicola Musatti Aug 11 '11 at 10:41

Hundreds of RegEx on one string

1 Answers1

Linked