Regex extremely slow on large documents

Question

When running the following code the CPU load goes way up and it takes a long time on larger documents:

string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
      pattern,
      RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase);

MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());

foreach (Match match in matches)
{
    ...
}

Any idea how to speed it up?

Sorry, this is large: http://www.amelia.se/Pages/Amelia-search-result-page/?q= — Jacqueline, Oct 10 '12 at 05:39
When i load in html from the page above it takes like 60 seconds to run through the regex — Jacqueline, Oct 10 '12 at 05:39
Do you have to use regex? Ordinarily, when parsing HTML, a HTML parser is used instead. See this [famous answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for the horror that awaits the unwary ;) — RB., Oct 10 '12 at 06:19
RegexHero crashes when processing your document with the regex you supplied http://i45.tinypic.com/2i72ryb.jpg. What exactly are you trying to capture? Maybe we can help you build a better regex. — RePierre, Oct 10 '12 at 06:20

verdesmarald · Accepted Answer · 2012-10-11T23:40:07.037

Changing RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase to RegexOptions.Compiled yields the same results (since your pattern does not include any literal letters or ^/$).

On my machine this reduces the time taken on the sample document you linked from 46 seconds to 21 seconds (which still seems slow to me, but might be good enough for you).

EDIT: So I looked into this some more and have discovered the real issue.

The problem is with the first half of your regex: \w+([-.]\w+)*\.\w+([-.]\w+)*@. This works fine when matching sections of the input that actually contain the @ symbol, but for sections that match just \w+([-.]\w+)*\.\w+([-.]\w+)* but are not followed by @, the regex engine wastes a lot of time backtracking and retrying from each position in the sequence (and failing again because there is still no @!)

You can fix this by forcing the match to start at a word boundary using \b:

 string pattern = @"\b\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";

On your sample document, this produces the same 10 results in under 1 second.

You might as well get rid of the `None` option while you're at it. — Alan Moore, Oct 10 '12 at 06:49

score 0 · Answer 2 · answered Oct 10 '12 at 06:12

0

Try to use regex for streams, use mono-project regex and this article can be useful for .Net

Building a Regular Expression Stream with the .NET Framework

and try to improve your regex performance.

answered Oct 10 '12 at 06:12

Ria

10,237
3
33
60

score 0 · Answer 3 · answered Oct 10 '12 at 06:21

To answer how to change it, you need to tell us, what it should match.

The problem is probably in the last part @\w+([-.]\w+)*\.\w+([-.]\w+)*. On a string "bla@a.b.c.d.e-f.g.h" it will have to try many possibilities, till it finds a match.

Could be a little bit of Catastrophic Backtracking.

So, you need to define you pattern in a better, more "unique" way. Do you really need "Dash/dot - dot - dash/dot"?

Regex extremely slow on large documents

3 Answers3

Linked