3

I know there have been many questions asking for help converting URLs to clickable links in strings, but I haven't found quite what I'm looking for.

I want to be able to match any of the following examples and turn them into clickable links:

http://www.domain.com
https://www.domain.net
http://subdomain.domain.org
www.domain.com/folder
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
domain.net
domain.com/folder

I do not want to match random.stuff.separated.with.periods.

EDIT: Please keep in mind that these URLs need to be found within larger strings of 'normal' text. For example, I want to match 'domain.net' in "Hello! Come check out domain.net!".

I think this could be accomplished with a regex that can determine whether the matching url contains .com, .net, .org, or .edu followed by either a forward slash or whitespace. Other than a user typo, I can't imagine any other case in which a valid URL would have one of those followed by anything else.

I realize there are many valid domain extensions out there, but I don't need to support them all. I can just choose which to support with something like (com|net|org|edu) in the regex. Unfortunately, I'm not skilled enough with regex yet to know how to properly implement this.

I'm hoping someone can help me find a regular expression (for use with PHP's preg_replace) that can match URLs based on just about any text connected by one or more dots and either ending with one of the specified extensions followed by whitespace OR containing one of the specified extensions followed by a slash and possibly folders.

I did several searches and so far have not found what I'm looking for. If there already exists a SO post that answers this, I apologize.

Thanks in advance.

--- EDIT 3 ---

After days of trial and error and some help from SO, here's what works:

preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.(\w+|-)*)+(?<=\.net|org|edu|com|cc|br|jp|dk|gs|de)(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
                create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
                return $m[1]."<a href=\"http://".$m[2]."\">".$m[2]."</a>"; else return $m[1]."<a href=\"".$m[2]."\">".$m[2]."</a>";'),
                $event_desc);

This is a modified version of anubhava's code below and so far seems to do exactly what I want. Thanks!

vertigoelectric
  • 1,307
  • 4
  • 17
  • 37

3 Answers3

4

You can use this regex:

#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is

Code:

$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn\'t it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',

);
foreach($arr as $url) {   
   $link = preg_replace_callback('#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is',
           create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
               return $m[1]."<a href=\"http://".$m[2]."\">".$m[2]."</a>"; else return $m[1]."<a href=\"".$m[2]."\">".$m[2]."</a>";'),
           $url);
   echo $link . "\n";

OUTPUT:

<a href="http://www.domain.com/?foo=bar">http://www.domain.com/?foo=bar</a>
http://www.that"sallfolks.com
This is really cool site: <a href="https://www.domain.net">https://www.domain.net</a>/ isn't it?
<a href="http://subdomain.domain.org">http://subdomain.domain.org</a>
<a href="http://www.domain.com/folder">www.domain.com/folder</a>
Hello! You can visit <a href="http://vertigofx.com/mysite/rocks">vertigofx.com/mysite/rocks</a> for some awesome pictures, or just go to <a href="http://vertigofx.com">vertigofx.com</a> by itself
<a href="http://subdomain.domain.net">subdomain.domain.net</a>
<a href="http://subdomain.domain.edu/folder/subfolder">subdomain.domain.edu/folder/subfolder</a>
Hello! Check out my site at <a href="http://domain.net">domain.net</a>!
welcome.to.computers
Hello.Come visit <a href="http://oursite.com">oursite.com</a>!
foo.bar
<a href="http://domain.com/folder">domain.com/folder</a>

PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thank you. Unfortunately, this does not detect the URLs at all. Please keep in mind that the URLs need to be found within blocks of normal text. For example, I need to match 'domain.net' in something like "Hello! Check out my site at domain.net!" – vertigoelectric Apr 11 '12 at 18:48
  • 1
    From your example in question it appeared that you just had a list of URLs. Anyway now it is clear, I just edited pls check my answer now. – anubhava Apr 11 '12 at 19:06
  • Thanks. Your new regex is much closer, except there are two problems. URLs without the http:// are converted to relative links, so for example a URL for something.domain.net would turn into a link like http://www.vertigofx.com/something.domain.net (assuming I had the page hosted at 'www.vertigofx.com') instead of its own absolute link. Also when it matched a URL with a folder path it included some other stuff after it. – vertigoelectric Apr 13 '12 at 23:43
  • Alright I edited my answer again as per your comments, pls check it now. – anubhava Apr 14 '12 at 12:18
  • That works much better, and the http detection is great, but there is still a flaw. It seems that when it encounters a URL with a path (such as domain.com/a/path/like/this) it matches it PLUS everything after it until it reaches another URL. For example, if I had "Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself" then it would match "vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com" as one big link. Still giving you an upvote for including http detection. – vertigoelectric Apr 16 '12 at 17:12
  • Are your it is behaving that way? I just gave input string `Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself` to my code and it generated this output: `Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself` – anubhava Apr 16 '12 at 17:34
  • Here is the online verification of my code with your input: http://ideone.com/MWlAk – anubhava Apr 16 '12 at 17:36
  • 2
    I figured out a conflicting issue. I was using nl2br() before doing preg_replace, so there were
    tags at the end of lines, which generally came directly after the URL. I fixed this and your regex is working better, but it's still not ideal. If I type "welcome.to.computers" it matches "welcome.to.com" and it should not. I realize it's unlikely that someone would type the right combination of dots and letters to create a false URL, but there must be a way to fix it. Can you make it so it MUST end with whitespace or forward slash in order to match?
    – vertigoelectric Apr 16 '12 at 21:32
  • That's a good point. I edited my answer and the example to avoid this particular case. – anubhava Apr 16 '12 at 21:37
  • Actually, I was just doing some experimenting on my own and a little extra research. I found another regex for matching URLs and took the part that I figured matched the path and added it like this: `'#(https?://)?\w+(\.\w+)+(?<=\.(net|org|edu|com))(\s|/[a-zA-Z0-9\&%_\./-~-]*)#'`. So far, it seems to match everything I want and nothing I don't want. I'm going to try to throw some tricky URLs at it to see if I can break it. If not, I'll select your answer as accepted since I'm using a modified version of your example. – vertigoelectric Apr 16 '12 at 22:02
  • Actually problem with your suggested regex is that it won't match any text ending with a URL, e.g. it won't convert `http://subdomain.domain.org` since it expects at least a space after URL. – anubhava Apr 16 '12 at 22:10
  • Check the edited code and example with URL `http://www.that"sallfolks.com` which is not being converted to links. – anubhava Apr 17 '12 at 04:53
  • I've used your code in combination with some of my own modifications and so far it's working well. I've updated my post to reflect the new code. I added support for dashes in the domain/subdomain as well as optional port numbers and some more domain extensions. I had to remove () around the extensions in order to allow extensions of different lengths (com vs cc). Anyway, thank you for the help! – vertigoelectric Apr 17 '12 at 15:44
  • Another update for your minor problem you reported, pls check. – anubhava Apr 17 '12 at 16:45
  • Adding the `(?=\s|\b)` at the end seemed to do the trick. All intended URLs now seem to be converted properly. Thank you so much! – vertigoelectric Apr 18 '12 at 02:41
1
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/]*/'

That works for your examples. You might want to add extra characters support for "-", "&", "?", ":", etc in the last bracket.

'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/'

This will support parameters and port numbers.

eg.: www.foo.ca:8888/test?param1=val1&param2=val2

Finch_Powers
  • 2,938
  • 1
  • 24
  • 34
  • Thanks for the help. I tried your second example and got a PHP warning about an unknown modifier "?" in preg_replace – vertigoelectric Apr 11 '12 at 18:50
  • Then try '/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/' – Finch_Powers Apr 11 '12 at 20:46
  • This works almost perfectly! Thank you. The only problem now is one of the problems I had with anubhava's solution. URLs without the 'http://' at the beginning are coming up as relative links. Of course, I could probably do a test for whether that's there and add it if it's not. I think this will do. I'll let you know if any problems come up. EDIT: Ah, just found a problem. The regex matched a string ending in .edue when it should not. How can I modify the regex so it requires the extension to be followed immediately by a space or a slash? – vertigoelectric Apr 13 '12 at 23:50
  • I went with preg_replace_callback() to include a function that added "http://" to links that were missing it. Now I just need some help with your regex in regards to matching the URLs with those extensions ONLY if the extensions are followed by space or slash (to prevent matching stuff like the ".Com" in "Hello.Come visit oursite.com!" <-- Oops. I just realized that if the URL is at the end of a sentence, it may be followed by punctuation. Could be trickier than I thought. Any ideas? Perhaps it can match an extension followed by space or any non-letter. – vertigoelectric Apr 14 '12 at 01:03
  • '/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[^\w.]{1}[\w\/\?=&-;]*/' should block "edue". For the last character, you can just verify it I guess. – Finch_Powers Apr 16 '12 at 15:08
  • Actually, that didn't work. It also messed up the rest of the URL detection. What can I add in there that would require a whitespace or forward slash after the extension to make a match? Thanks again for all of your help, by the way. – vertigoelectric Apr 16 '12 at 17:02
  • (http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*[\/ ] – Finch_Powers Apr 16 '12 at 18:45
  • Thanks again, but unfortunately that only matches URLs with one or two levels of subfolders. It matches "www.vertigofx.com/this/" and "www.vertigofx.com/this/andthis" but NOT "www.vertigofx.com" or the "YYY" in "www.vertigofx.com/x/xx/YYY". – vertigoelectric Apr 16 '12 at 21:20
0

Thanks a ton. I modified his final solution to allow all domains (.ca, .co.uk), not just the specified ones.

$html = preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.[a-z]{2,3})+(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
    create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2])) return $m[1]."<a href=\"http://".$m[2]."\" target=\"blank\">".$m[2]."</a>"; else return $m[1]."<a href=\"".$m[2]."\" target=\"blank\">".$m[2]."</a>";'),
    $url);
Mark Shust at M.academy
  • 6,300
  • 4
  • 32
  • 50
  • I'm not that great with regular expressions. How did you manage to determine the difference between the end of a valid domain name and the end/beginning of poorly typed sentences? For example, the difference between "subdomain.domain.ca" and "Hey.Hello.Can you read this?" – vertigoelectric Jul 18 '12 at 14:14