2

I am trying to parse URLs in a string of text. Currently my RegEx pattern looks like this:

(http(s)?://)?\S+\.(com|net|org|edu)\S*(?<!\W)

Sample text:

On that sample page (http://example.com/test/new.php), when you use the button, they are there, but when you use the inline, they are not.

Right now it keeps capturing the opening (. I cant seem to get this right. Any tips? I am using .NET 4.0 and C# to try and parse this.

UPDATE: a sample text more reflective of the links it needs to capture

On that sample page (http://example.com/test/new.php), when you use the button, it redirects to sample.com/help instead of https://www.example.com or just example.com
Mike_G
  • 16,237
  • 14
  • 70
  • 101
  • 1
    See if this link helps : http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149 or this : http://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string – PaulF Jul 30 '15 at 15:46
  • not to address the question as asked, but as a forewarning for you going forward, there's a whole TON of top level domains that you're not accounting for: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains these all could potentially produce very valid URL's as it stands right now, and most, if not all of them are already actively in use. This list is also subject to having items added to it at any point in time... – user2366842 Jul 30 '15 at 16:03
  • @user2366842 yeah, I'm only concerned about the ones I listed. Thanks though. – Mike_G Jul 30 '15 at 16:12

5 Answers5

3

Because you have a ? after your first group (http(s)?://)?, the regex engine is free to backtrack and try the expression without matching it. Because the next part of the expression is \S*+ it is free to match the parenthesis and the rest of the url as well.

Removing the ? should do the trick in this case, but doesn't solve the problem of making it optional. Let me know if that part actually needs to be optional and maybe give some additional sample data.

Kit
  • 20,354
  • 4
  • 60
  • 103
gymbrall
  • 2,063
  • 18
  • 21
  • Yes, the https:// does need to be optional. I have updated my post to reflect the URLs it needs to capture – Mike_G Jul 30 '15 at 16:08
1

If you add a \b (word boundary) anchor in front of your regex, it will work as intended:

\b(http(s)?://)?\S+\.(com|net|org|edu)\S*(?<!\W)

Shlomo
  • 14,102
  • 3
  • 28
  • 43
1

The problem is that the \S+ is matching more greedily than the (http(s)?://)?

Your expression effectively becomes:

\S+\.(com|net|org|edu)\S*(?<!\W)

You can see this by removing the "?" from the http expression:

(http(s)?://)\S+\.(com|net|org|edu)\S*(?<!\W)

You also might want to read this for more thoughts on the problem's real difficulty.

https://mathiasbynens.be/demo/url-regex

Robert Horvick
  • 3,966
  • 21
  • 18
0

Thanks to gymbrall showing me why its wrong, and to PaulF for pointing me to a stackoverflow question with a partial answer. I was able to modify the regex in this question to fit my needs:

((http|ftp|https):\/\/)*([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?(?<!\W)

With the sample text:

On that sample page (http://example.com/test/new.php), when you use the button, it redirects to sample.com/help instead of https://www.example.com or just example.com

The regex will correctly match:

http://example.com/test/new.php
sample.com/help
https://www.example.com
example.com
Community
  • 1
  • 1
Mike_G
  • 16,237
  • 14
  • 70
  • 101
-1

I'm not 100% sure why that isn't working, but this one should get the job done for you.

(http://?|https://?)\S+\.(com|net|org|edu)\S*(?<!\W)

Give it a shot here: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

Linkfan03
  • 37
  • 5