0

Possible Duplicate:
c# regex email validation

I am currently using the following regex and code to parse Email addresses from html documents

string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
      pattern,
      RegexOptions.None | RegexOptions.Compiled);

MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());

foreach (Match match in matches)
{
    ...
}

For example:

Try parsing http://www.amelia.se/Pages/Amelia-search-result-page/?q=

Over at RegexHero and it crashes.

Is there any way to optimise this?

Community
  • 1
  • 1
Elvin
  • 367
  • 3
  • 5
  • 16
  • 3
    1. Don't parse HTML using regular expression, use a proper parser. 2. Don't match if the string is an e-mail using regular expression, use a library (for example of the complexity using a regex see, http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html) – Anders Oct 10 '12 at 08:36
  • 4
    I could only think of one reason to extract email addresses from arbitrary HTML documents, and that's one I certainly won't support. – Philipp Oct 10 '12 at 08:43
  • Please read this: http://www.regular-expressions.info/catastrophic.html. That's the reason your regular expression is slow and has a high CPU load. – Daniel Hilgarth Oct 10 '12 at 08:47
  • @Elvin: Read to complete article ;-) – Daniel Hilgarth Oct 10 '12 at 08:50
  • There is a very similar problem being discussed [here](http://stackoverflow.com/q/12803859/20670). You're running into the same problem. – Tim Pietzcker Oct 10 '12 at 08:58
  • @TimPietzcker the topic you link to speaks about V8, the Chrome JavaScript engine. How is that relevant to the RegEx class from .NET? – CodeCaster Oct 10 '12 at 09:21
  • 1
    @CodeCaster: It's the same problem of catastrophic backtracking which is practically universal across regex engines. But .NET has a solution that JavaScript doesn't have (and it's even outlined in my answer to the JavaScript question). – Tim Pietzcker Oct 10 '12 at 11:13

1 Answers1

1

To elaborate on @Joey's suggestion, I would advocate going through you input line by line, drop any line which does not contain @, and apply your regex to the one that do. This should reduce the load considerably.

private List<Match> find_emails_matches()
{
    List<Match> result = new List<Match>();

    using (FileStream stream = new FileStream(@"C:\tmp\test.txt", FileMode.Open, FileAccess.Read))
    {
        using(StreamReader reader = new StreamReader(stream))
        {
            string pattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
            Regex regex = new Regex(pattern, RegexOptions.None | RegexOptions.Compiled);

            string line;
            while((line = reader.ReadLine()) != null)
            {
                if (line.Contains('@'))
                {
                    MatchCollection matches = regex.Matches(line); // Here is where it takes time                            
                    foreach(Match m in matches) result.Add(m);
                }
            }
        }
    }

    return result;
}
zeFrenchy
  • 6,541
  • 1
  • 27
  • 36