Regular expression to find URLs but not include punctuation AFTER the URL

Question

Example: "My site is http://www.abcd.com, and yours is http://www.def.ghi/jkl. Is Fred's https://www.xyz.com? Or is it http://www.xxx.com?abc=def? (I thought his site was http://www.mmm.com), but clearly it's not."

This should extract

http://www.abcd.com http://www.def.ghi/jkl https://www.xyz.com http://www.xxx.com?abc=def http://www.mmm.com

Notes: it should assume that any punctuation following the url is NOT part of the url, e.g. the comma after http://www.abcd.com, is not part of the url. This includes trailing question marks, which I realize in actuality COULD be part of the url. Of course, if a question mark is followed by querystring data, it SHOULD be considered part of the url. Note that urls might be followed by multiple punctuation marks, as in the the case of (Is your url http://abcd.com)?

Urls (and their trailing punctuation, if any) will always be followed by a space, a newline/return character -- or they'll be the end of the string being tested.

The will be preceded by a whitespace character or, possibly, an open bracket or parenthesis, as in "Please visit my site (http://www.abcd.com)." Or they'll come at the beginning of the string.

This regexp should work for http, https and ftp.

This is for an Actionscript project. I believe that Actionscript uses the same regular-expression engine as Javascript.

Thanks!

This is what I've started with (((https)|(http)|(ftp))://(.*?))(([\.,!\?;:\)\'\"]?) |$). It fails with urls that are followed by multiple punctuation marks. I notice that StackOverflow does exactly what I want to do in its preview (what you see below where you type the text of a question). I don't know if their solution is hidden somewhere in their Javascript. I'm trying to figure that out. — Marcus Geduld, Jun 27 '11 at 14:28
Have you tried using the gskinner regex tool? http://gskinner.com/RegExr/ — shanethehat, Jun 27 '11 at 14:30
THIS seems to work: This seems to work: (\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|]) I found it here http://stackoverflow.com/questions/4563777/extract-and-add-link-to-urls-in-string (I searched earlier and couldn't find anything. Sorry if this is a double post.) — Marcus Geduld, Jun 27 '11 at 14:39
Note to any Actionscript Developers reading this: don't make the mistake I did and use the RegExp constructor with this regular expression. In other words, don't do this: var re : RegExp = new RegExp("(\b... etc", "gi"); The string passed to the constructor will choke on some of the escape characters. Instead, use the literal notation: var re : RegExp = /(\b ... etc/gi; — Marcus Geduld, Jun 27 '11 at 15:16
Markus, I think you'd better be using some simple approach, and not trying to come up with a complete and robust search expression. Dommer's link is great, I agree. But just have a look at the structure of URI: http://www.ietf.org/rfc/rfc2396.txt http://rfc-ref.org/RFC-TEXTS/2396/chapter12.html Human-written URIs are almost never well-formed. If you want to eliminate punctuation, aim for the question mark at the end, and that's all. — Michael Antipin, Jun 27 '11 at 20:50

Tom Chantler · Answer 1 · 2011-06-27T18:22:05.673

1

Have a look here: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

EDIT: shanethehat and divillysausages also mentioned this link: http://gskinner.com/RegExr/ which I hadn't seen before and which features online evaluation (in other words, you can tune your regex without firing up your coding IDE, which is awesome). Thanks!

edited Jun 27 '11 at 18:22

answered Jun 27 '11 at 14:32

Tom Chantler

14,753
4
48
53

Looks like a great resource. Thanks! – Marcus Geduld Jun 27 '11 at 14:41
make sure and check out http://gskinner.com/RegExr/ as well, where you can test things in real-time. it also has community submitted ones, which can set you in the right direction (i.e. first url one is `/(((f|ht){1}tp://)[-a-zA-Z0-9@:%_\+.~#?&//=]+)/g` which gets pretty close to what you want) – divillysausages Jun 27 '11 at 15:34
That's a great link. Thanks! I will incorporate it into my answer for posterity. – Tom Chantler Jun 27 '11 at 18:20

score 0 · Answer 2 · edited May 23 '17 at 12:28

First off, rolling your own regexp to parse URLs is a terrible idea. You must imagine this is a common enough problem that someone has written, debugged and tested a library for it, according to the RFCs. There are a ton of edge cases when it comes to parsing URLs: international domain names, actual (.museum) vs. nonexistent (.jpg) URLs, weird punctuation including parentheses, punctuation at the end of the URL etc.

I've looked at a ton of libraries, and they all have their downsides. See a comparison of JavaScript URL parsing libraries here.

If you want a regular expression, the one in Component is quite comprehensive.

Regular expression to find URLs but not include punctuation AFTER the URL

2 Answers2