1

This code hangs in infinite loop.

Any ideas why is that? Is that a bug in .NET? Can I do something about it?

Dim urlRegex As New
Regex("((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|ftp[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))",
RegexOptions.IgnoreCase)

Dim match As System.Text.RegularExpressions.Match = urlRegex.Match("<a ""javascript:window.Add(location.href,document.title)"">")
Oded
  • 489,969
  • 99
  • 883
  • 1,009
Jiri
  • 264
  • 4
  • 17
  • Can you throw a try/catch around it and see if you get any exceptions? – user1231231412 Jan 18 '12 at 13:46
  • No exception when executed in Try/Catch block. – Jiri Jan 18 '12 at 13:51
  • Jiri, it doesn't hang, it just takes a veryyyyy loooong time. So it's not a bug, you should edit your question to ask for ways to optimize your Regex instead. – Meta-Knight Jan 18 '12 at 13:56
  • Meta-Knight: Are you sure? It run over night and didn't finish. – Jiri Jan 18 '12 at 13:59
  • 3
    Your problem is almost definitely backtracking related. Read the MSDN article here on it http://msdn.microsoft.com/en-us/library/gg578045.aspx#Backtracking . If that makes sense try to implement it. If it doesn't (and even if it does), for the sake of future coders including your future self, rewrite your regex to be something simpler, possibly two or three regexes. Find a larger pattern that you can target, pull out the substring and then parse the smaller parts. And @Meta-Knight is right, it took about 5 minutes to complete on my machine. – Chris Haas Jan 18 '12 at 14:06
  • @Jiri: It also took 5 minutes to complete on my machine. – Meta-Knight Jan 18 '12 at 14:09
  • Thanks for the info guys. I will try to find more optimized regex for finding URLs. – Jiri Jan 18 '12 at 14:44
  • Id rather do a regex to find "hrefs" and "src" and then get the content, in the html iso of the URL itself, that way you can get absolute and relative Urls too. But probably is a too simplistic aproach... ;-) – H27studio Jan 18 '12 at 15:42
  • H27: We use this to convert mainly plain text, which can contain HTML snippets. The input can be nearly anything. – Jiri Jan 18 '12 at 15:48
  • you can time out regex operation this answer might help you https://stackoverflow.com/a/7616440/1434834 – Nikki Dec 07 '17 at 10:23

1 Answers1

2

As others have mentioned, this is due to excessive backtracking. A good article on this topic can be read here: Catastrophic Backtracking.

Your options are:

  1. Define your pattern better, without nested quantifiers that can cause catastrophic backtracking. This requires you to define your problem better. Build a list of possible inputs and perhaps a better pattern will emerge. Your pattern looks like it's trying to do too much, by specifying what is allowed and what isn't allowed. Sometimes it's possible to simplify the pattern by doing one or the other. What do you want to match?

  2. Use .NET 4.5's new Regex timeout feature (once it's officially released). Although this isn't a direct solution to your problem it does aid against hanging matches caused by poor patterns. I've covered this here: How do I timeout Regex operations to prevent hanging in .NET 4.5?

Community
  • 1
  • 1
Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174