I thought this would be a simple google search but apparently not. What is a regex I can use in C# to parse out a URL including any query string from a larger text? I have spent lots of time and found lots of examples of ones that don't include the query string. And I can't use System.URI, because that assumes you already have the URL... I need to find it in surrounding text.
-
What are the rules? Are they going to be properly encoded, or could there be spaces in the string? If they're going to be properly encoded just about any of the patterns you've previously should work if you add simply append a search for non-whitespace characters to the end of it. – Guildencrantz Feb 26 '10 at 17:02
6 Answers
This should get just about anything (feel free to add additional protocols):
@"(https?|ftp|file)\://[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*"
The real difficulty is finding the end. As is, this pattern relies on finding an invalid character. That would be anything other than letters, numbers, hyphen or period before the end of the domain name, or anything other than those plus forward slash (/), question mark (?), ampersand (&), equals sign (=), semicolon (;), plus sign (+), exclamation point (!), apostrophe/single quote ('), open/close parentheses, asterisk (*), underscore (_), tilde (~), or percent sign (%) after the domain name.
Note that this would allow invalid URLs like
http://../
And it would pick up stuff after a URL, such as in this string:
Maybe you should try http://www.google.com.
Where "http://www.google.com."
(with the trailing period) would be matched.
It would also miss URLs that didn't begin with a protocol specification (specifically, the protocols within the first set of parentheses. For instance, it would miss the URL in this string:
Maybe you should try www.google.com.
It's very difficult to get every case without some better-defined boundaries.

- 28,912
- 9
- 68
- 92
-
Not working... see response over here: http://stackoverflow.com/questions/9125016/get-url-from-a-text – nikib3ro May 10 '12 at 07:54
-
@kape123: "Not working" is not very helpful. I pointed out exactly what its shortcomings were. It works as described. Is there some other case that you'd expect to work that doesn't? – P Daddy May 11 '12 at 02:59
Use the ABNF at the end of RFC3986 as a starting point to get it right.
This uses them for URI validation in Python; not what you're looking for, but it should give an idea of the direction you should go in: http://gist.github.com/138549

- 5,546
- 1
- 25
- 21
Sorry I'm not yet able to add comments, but would like to point out that P Daddy's answer requires a little tweaking:
@"(https?|ftp|file)\://[a-zA-Z0-9\.\-]+(/[a-zA-Z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*"

- 119
- 3
- 9
-
1I can't find anything different except that you reversed the order of upper- and lower-case characters (a no-op), and in so doing, fixed a typo I had where I had `a-Z` (lower-case 'a' to upper-case `Z`). Next time, it would be simpler to just point out the typo. I'll fix it. – P Daddy Aug 13 '10 at 16:23
I came up with the following:
URL with protocol
^(https?|ftp|file)\:\/\/([a-zA-Z0-9]+[a-zA-Z0-9\-_])+(\.([a-zA-Z0-9]+[a-zA-Z0-9\-_])+)+(\/(?!\/)[a-zA-Z0-9\-_\.]*)*(\??)[a-zA-Z0-9_\-\.~=%]*$
URL without protocol
^([a-zA-Z0-9]+[a-zA-Z0-9\-_])+(\.([a-zA-Z0-9]+[a-zA-Z0-9\-_])+)+(\/(?!\/)[a-zA-Z0-9\-_\.]*)*(\??)[a-zA-Z0-9_\-\.~=%]*$

- 20,174
- 9
- 81
- 146
Check out this guy's QueryString builder class -
Microsoft also has a UriBuilder that might help you -
http://msdn.microsoft.com/en-us/library/system.uribuilder.query.aspx

- 7,304
- 2
- 23
- 26
-
1These look fine for building query strings, but JoelFan wants to identify URLs, not build them. – thetaiko Feb 26 '10 at 16:54