.Net Regex to extract Urls from text never returns

Question

I have a .Net Regex expression that extracts all valid looking Urls from a string. It passes all 20 of my unit tests except one: when the Url contains a questionmark. This causes the Regex engine to hang the process and wreak havoc.

I've spent quite a bit of time on this expression, but I'm unsure how to fix the issue. Hopefully a Regex pro out there can lend a hand! Two part question:

1) How can this regex pattern be improved so it does not hang when evaluating string "http://www.example.com/?a=1":

Regex pattern:

(?<URL>(\s|^)((?<Scheme>[A-Za-z]{3,9})[:][/][/])?([A-Za-z0-9-]+[.])+([A-Za-z]+){2,4}(?(Scheme)|[/])(((?:[\+~%\/.\w-_]*[/]?)?\??#?(?:[\w]*))?)*(\s|$))

I would suggest using this awesome online .Net Regex testing engine: http://regexhero.net/tester/

2) What can I do in my calling code to prevent/recover from Regex engine hangups? Here's the calling code:

        Regex linkParser = new Regex(UrlMatchRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Find matches and add them to the result
        foreach (Match m in linkParser.Matches(message))
        {
            result.Add(m.Value.Trim());
        }

I've spent quite a bit of time on this particular Regex pattern, and have tried at least 7 alternates I've found (including most from this page: http://mathiasbynens.be/demo/url-regex). I would prefer to improve on my existing pattern, rather than use an entirely different one, but I'm open to options.

Final Regex v1:

After a few revisions, with some help from below, my final regex match looks like this:

(?<URL>((?<=\s)|^)((?<Scheme>[A-Za-z]{3,9})[:][/][/])?([A-Za-z0-9-]+[.])+[A-Za-z]{2,4}(?(Scheme)|[/]?)(((?:[\+~%\/.\w-_]*[/]?)?\??#?&?(?:[\w=]*))?)*((?=\s)|$))

**Regex v2: Updated to include a set list of domains, port numbers, hashtags, but does not pull in trailing slashes

(?<URL>((?<=\b)|^)((?<Scheme>[A-Za-z]{3,9})[:][/][/])?([A-Za-z0-9-]+[.])+((?<TLD>\b(aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mn|mn|mo|mp|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|nom|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ra|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw|arpa)\b))+(\/?)+((((?:[:\+~%/.\w-_\/])?[\?]?[#]?[&]?(?:[\w=]*))?)+(?=\?|\b)))

Test cases: this is a test of #

Should not match:

like when you a/b test something
this sentence ends with a/
just a #hashtag

Should match:

sos.me extra
sos.me. extra <-- not period
sos.me, extra <-- not comma
sos.me/ extra <-- should match traililng /
sos.me/#hashtag? extra <-- not questionmark
sos.me? extra <-- not questionmark
www.sos.me? extra <-- not questionmark
sos.me/abcde extra
sos.me/abcde#hashtag extra
sos.me/abcdf/0 extra
sos.me/abcdf/0#hashtag extra
sos.me/something.aspx extra
sos.me/something.aspx#hashtag extra
http://something.com: extra <-- not colon
sos.me/something.aspx?name=value extra
sos.me/something.aspx?name=value#hashtag extra
http://something.com extra
https://something.com extra
http://something.com:80 extra
http://something.com:80/ extra <-- should match trailing /
http://something.com:80/?a=v&1=2 extra
http://something.com:80/?a=v&1=2#hashtag extra
http://something.com:80/?a=v&1=2# extra <-- should match trailing #

Where are you testing "="? Your pattern seems not to accept any "=" character. — Cédric Bignon, Jan 28 '13 at 20:34

score 5 · Accepted Answer · edited May 23 '17 at 10:24

5

QUESTION 1

This is causing catastrophic backtracking:

([A-Za-z]+){2,4}

And:

[\+~%\/.\w-_]*[/]?

I'm guessing you meant:

([A-Za-z]{2,4})

And:

[\+~%\/.\w-_]*

Your regex does not currently match URLs with equal signs, which is why the catastrophic backtracking happens in the first place:

\??#?(?:[\w]*)

does not match '?a = 1' because \w does not include '='. You can fix this pretty easily:

\??#?(?:[\w=%]*)

(I threw '%' in there too)

Also, your regex is matching the whitespace before and after your URL. You may prefer lookarounds to \s as this will match a position before or after whitespace rather than the whitespace itself:

(?<URL>((?<=\s)|^)((?<Scheme>[A-Za-z]{3,9})[:][/][/])?([A-Za-z0-9-]+[.])+[A-Za-z]{2,4}(?(Scheme)|[/])(((?:[\+~%\/.\w-_]*[/]?)?\??#?(?:[\w=]*))?)*((?=\s)|$))

QUESTION 2

There's nothing you can do to detect or recover from a catastrophically backtracking regular expression. The best you could try is spinning the regex off into an independent thread and then terminating the thread after a certain timeout. Exactly how to do that would be an entirely different question, but it's not difficult to find tutorials online for .NET.

See:

edited May 23 '17 at 10:24

Community

1
1

answered Jan 28 '13 at 20:34

JDB

25,172
5
72
123

2

The character "=" is not tested anywhere in the regex. – Cédric Bignon Jan 28 '13 at 20:44
You are right it is a backtracking problem. I was under the impression that this portion of the match was for questionmarks or hashes: \??#? (zero or one escaped questionmark) (zero or one hashmark) – mellodev Jan 28 '13 at 20:45
Dev box crashed, gimme a few and I'll update/mark answer. tyvm for quick replies! – mellodev Jan 28 '13 at 20:49
1

+1 I somehow found the fix too, but wouldn't have been able to explain it this well :) – Cristian Lupascu Jan 28 '13 at 20:52
Brilliant, \b does fix the issue of extraneous spaces (calling code was doing a .trim() afterwards to compensate. However, if I repalce the \s with \b on both ends, it will not return the trailing slash on (www.example.com/) – mellodev Jan 28 '13 at 20:54
1

@mellodev - That's true. Updated. – JDB Jan 28 '13 at 20:58
Excellent! I added to the expression to find cases with no scheme or trailing slash, and cases including an amphersand. I'll update my original post with the final query. Oh, also any advice for #2 question in my post? – mellodev Jan 28 '13 at 21:07
1

@mellodev - Updated with answer to #2 – JDB Jan 28 '13 at 21:09
Marked as answer - wish I could give you more points. Thanks for all your help! – mellodev Jan 28 '13 at 21:10
@Cyborgx37 I had to make a few changes to the expression to limit to valid TLDs, include amphersands, hashtags, disregard trailing punctuation, etc. All I need to have it pull in trailing slashes, but I can't seem to get it to go. Would you mind taking one more look please? Check the last expr in my question – mellodev Jan 29 '13 at 01:46

.Net Regex to extract Urls from text never returns

1 Answers1