16

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.

Does anyone know a reliable solution that can do this?

All the solutions I have found work for some URL's but not for others.

Thanks

Jonah
  • 9,991
  • 5
  • 45
  • 79
Jack Harvin
  • 6,605
  • 7
  • 23
  • 21

5 Answers5

22

John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

If you only wanted to match HTTP/HTTPS:

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
  • 4
    For anyone who wants all the sub-patterns converted to be non capturing, and the forward slashes escaped: \b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) – Highly Irregular Jan 09 '12 at 01:27
  • TLDs may have much more than 4 characters, see: http://www.iana.org/domains/root/db – Toto Sep 30 '15 at 16:29
  • 3
    And how do we use this regex within preg? I mean, because it has `"` and `'` the code doesn't work properly, like: `preg_match('(?i)\b......]))', $str)` - all code seems like it is commented. – Linesofcode Feb 14 '16 at 14:51
  • Not working. Preg_match & preg_match_all failing everytime, even after removing single/double quotes – Aakash Sahai Jul 20 '19 at 08:46
3
$string = preg_replace('/https?:\/\/[^\s"<>]+/', '<a href="$0" target="_blank">$0</a>', $string);

It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:

$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '<a href="$0" target="_blank">$0</a>', $string);
Jonah
  • 9,991
  • 5
  • 45
  • 79
  • 1
    You might also want to exclude `<` or apply `htmlspecialchars` on the matched string to avoid code injection. – Gumbo Dec 08 '10 at 18:05
  • Good, but if you look at the expression, it allows anything but white-space and `"`. I believe that eliminates any HTML injection. – Jonah Dec 08 '10 at 18:06
  • 1
    Bron: No, you are using the matched value not just as attribute value but also as the elements text content. – Gumbo Dec 08 '10 at 18:11
2

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.

I created a PHP library that could deal with lots of edge cases: Url highlight.

You could extract urls from string or directly highlight them.
Example:

<?php

use VStelmakh\UrlHighlight\UrlHighlight;

$urlHighlight = new UrlHighlight();

// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']

// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, <a href="http://example.com">http://example.com</a>.'

For more details see readme. For covered url cases see test.

vstelmakh
  • 742
  • 1
  • 11
  • 19
0

Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.

But if that is acceptable, then patterns like this should do it:

\<http:[^ ]+\>

It looks like stackoverflow's parser is better. Is is open source?

wallyk
  • 56,922
  • 16
  • 83
  • 148
-1

This code is worked for me.

function makeLink($string){

/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\@(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","<a href=\"mailto:$1\">$1</a>",$string);

return $string;
}
Paras Dalsaniya
  • 99
  • 2
  • 10
  • 1
    Why are you limiting tld to 3 characters? Have a look at: http://www.iana.org/domains/root/db – Toto Sep 30 '15 at 16:28