0

After having recently read about a phenomenon known as "catastrophic backtracking", it seems that my very own regex pattern is causing some sort of CPU issues. I use this expression to scan large HTML strings from 100k-200k characters. The pattern matches IP addresses in the format IP:port (e.g. 1.1.1.1:90). The pattern is as follows:

private static Regex regIp = new Regex(@"(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\." +
        @"(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4]" +
        @"[0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}" +
        @"[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])\:[0-9]{1,5}", 
        RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.CultureInvariant);

The expression is used as follows:

MatchCollection matchCol = regIp.Matches(response);

foreach (Match m in matchCol)
{ 
    doWorkWithMatch(m);
}

After running about 100 strings through this regex pattern, it starts to choke the computer and use 99% of the CPU. Is there a more logical way to structure this expression to reduce CPU usage and avoid backtracking? I'm not sure if backtracking is even occurring or if it is just an issue of too many threads executing regex evaluations simultaneously - all input is welcome.

user1111380
  • 551
  • 2
  • 6
  • 17
  • 8
    Do not attempt to parse an irregular language like HTML with Regex - there lies the way of madness. Don't do it. Not ever. Find another way. Join a monastery. Take up crochet. Anything but this! –  Jun 30 '13 at 03:20
  • On a more serious note: this isn't the way to do it, as you've found. I can't help with c#, but look for an alternative. PHP has DOMDocument. Maybe c# has something similar. –  Jun 30 '13 at 03:21
  • Hey @MikeW, thanks for the feedback but the content I need could be present at any point in the HTML document; thus I need to scan the entire page and all HTML - why bother using CPU cycles to render/parse the HTML into a document first? – user1111380 Jun 30 '13 at 03:24
  • 1
    Parsing HTML with a Regex is a non-starter except in very specific circumstances. I'm only suggesting you fnd a different way, so if parsing HTML into a document doesn't suit you, use a different method. –  Jun 30 '13 at 03:30
  • 1
    Take a look at [Html Agility Pack](http://htmlagilitypack.codeplex.com/) - I've never used it, but I've seen it recommended many times when trying to parse HTML with .NET. – Tim Jun 30 '13 at 03:35
  • 2
    @Tim given the pattern op wants to match,regex is the right choice – Anirudha Jun 30 '13 at 03:40
  • 2
    Regexes are fine for this application because you don't care at all about how the language is structured. Mike's point is valid if your search requirements are dependent on html's structure. Language is meaningless in your application. – Daniel Gimenez Jun 30 '13 at 03:41
  • OH, you have _no_ idea [what CPU intensive Regex looks like](http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html), sister. That, my friend, is `Mail::RFC822::Address` for Perl. (I _knew_ bookmarking that would have a use someday) – Cole Tobin Jun 30 '13 at 03:43
  • @MikeW http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Cole Tobin Jun 30 '13 at 03:45
  • I think regex has nothing to do with this issue. i think the problem is coming from "doWorkWithMatch(m)" – Desolator Jun 30 '13 at 04:05
  • What I find the most interesting is the the regex checks for a valid IP address, but the port number portion doesn't care if it's between 1 and 65535. So apparently 127.0.0.1:99999 would be valid in this case. – Agent_9191 Jun 30 '13 at 04:44
  • @user1111380, please tell us how this works out. – Daniel Gimenez Jun 30 '13 at 14:25
  • So, I tried using HTML agility pack to parse the contents into a document and then reference the 'InnerText' of the document for regex matching. No gains; ended up reaching the same CPU deadlock/99% use case eventually. As suggested by @RobinVanPersi, the issue was in doWorkWithMatch(m) - trying to evaluate a very large XMl document with another regex expression was causing the massive CPU use. I still managed to improve my IP scraping regex using this post, though! – user1111380 Jul 01 '13 at 01:57

4 Answers4

5

why are you parsing and validating using regex

you should use this regex to parse the string

\d+[.]\d+[.]\d+[.]\d+(:\d+)?

and then you can check if that ip address has valid range by parsing them to int and then checking the range

Anirudha
  • 32,393
  • 7
  • 68
  • 89
2

This regex looks well designed, and I can't see anywhere you can improve it if your're going for 100% accuracy. However you can test if something simplier that will probably always work improves results.

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}

Obviously this could catch something that isn't right like 999.999.999.999:999. But you have to ask yourself if unrealistic input like that might occur. If this does improve performance, and you're reasonably sure you won't have crazy input like my example, then use it and use your more accurate regex to cull the list.

Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
  • 1
    This expression is actually slower in practice than Denomale's listed answer below, plus it requires further validation of IP addresses. Thank you for your effort but no cigar. – user1111380 Jul 01 '13 at 00:34
  • @user1111380 Thanks for the feedback. I'm actually very surprised that this is slower. I have to investigate this further because I would love to know why. Good luck! – Daniel Gimenez Jul 01 '13 at 01:00
2

This is the regex I use for testing and validating IP addresses

I've added your port test at the end:

(?:(?:1[0-9]{2}|2[0-4][0-9]|25[0-5]|[1-9][0-9]|[0-9])\.){3}(?:1[0-9]{2}|2[0-4][0-9]|25[0-5]|[1-9][0-9]|[0-9]):[0-9]{1,5}

enter image description here

I see you're also capturing all the individual octets, you'll get a performance boost by using the non capturing (?:...) syntax and later just split the validated string on the non digits.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • While this wasn't the root cause of the issue, this particular Regex pattern did offer significant performance improvements to the originally posted expression. Thanks! How did you generate that nice graph/visual for the regex?! The real issue at hand was trying to load a 300k character XML document into regex for matching HTML anchor tags - I don't know if it was the fact that it was XML or the fact that it was 300k characters making the CPU stumble but it choked up the program. – user1111380 Jul 01 '13 at 02:00
  • 1
    Cool I'm glad this helped. To make that chart I'm using debuggex.com. Although it doesn't support lookbehinds, named capture groups, or atomic groups it's still handy for understanding the expression flow. There is also regexper.com. They do a pretty good job too, but it's not real time as you're typing. – Ro Yo Mi Jul 01 '13 at 02:37
0

maybe you can group the results and test them later not to be more than 255 and 65535 for the port, i.e. like in Daniel Gimenez's answer but with groups (\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\:(\d{1,5}) and then do your test on the matched groups.

It's generally a bad idea to have so many | in a regular expression.

Mystic Odin
  • 269
  • 1
  • 2
  • 13