0

When running the following code the CPU load goes way up and it takes a long time on larger documents:

string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
      pattern,
      RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase);

MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());

foreach (Match match in matches)
{
    ...
}

Any idea how to speed it up?

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
Jacqueline
  • 481
  • 2
  • 11
  • 20
  • Can you quantify "long time"? – Michael Petrotta Oct 10 '12 at 05:34
  • On large documents it can take up to 1 minute per document – Jacqueline Oct 10 '12 at 05:37
  • 1
    "Large" might also be worth defining, I think... – ChimeraObscura Oct 10 '12 at 05:38
  • Sorry, this is large: http://www.amelia.se/Pages/Amelia-search-result-page/?q= – Jacqueline Oct 10 '12 at 05:39
  • When i load in html from the page above it takes like 60 seconds to run through the regex – Jacqueline Oct 10 '12 at 05:39
  • Do you have to use regex? Ordinarily, when parsing HTML, a HTML parser is used instead. See this [famous answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for the horror that awaits the unwary ;) – RB. Oct 10 '12 at 06:19
  • RegexHero crashes when processing your document with the regex you supplied http://i45.tinypic.com/2i72ryb.jpg. What exactly are you trying to capture? Maybe we can help you build a better regex. – RePierre Oct 10 '12 at 06:20

3 Answers3

2

Changing RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase to RegexOptions.Compiled yields the same results (since your pattern does not include any literal letters or ^/$).

On my machine this reduces the time taken on the sample document you linked from 46 seconds to 21 seconds (which still seems slow to me, but might be good enough for you).

EDIT: So I looked into this some more and have discovered the real issue.

The problem is with the first half of your regex: \w+([-.]\w+)*\.\w+([-.]\w+)*@. This works fine when matching sections of the input that actually contain the @ symbol, but for sections that match just \w+([-.]\w+)*\.\w+([-.]\w+)* but are not followed by @, the regex engine wastes a lot of time backtracking and retrying from each position in the sequence (and failing again because there is still no @!)

You can fix this by forcing the match to start at a word boundary using \b:

 string pattern = @"\b\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";

On your sample document, this produces the same 10 results in under 1 second.

verdesmarald
  • 11,646
  • 2
  • 44
  • 60
0

Try to use regex for streams, use mono-project regex and this article can be useful for .Net

and try to improve your regex performance.

Ria
  • 10,237
  • 3
  • 33
  • 60
0

To answer how to change it, you need to tell us, what it should match.

The problem is probably in the last part @\w+([-.]\w+)*\.\w+([-.]\w+)*. On a string "bla@a.b.c.d.e-f.g.h" it will have to try many possibilities, till it finds a match.

Could be a little bit of Catastrophic Backtracking.

So, you need to define you pattern in a better, more "unique" way. Do you really need "Dash/dot - dot - dash/dot"?

stema
  • 90,351
  • 20
  • 107
  • 135