Regex to linkify URLs

Question

I currently have the following regex to capture link text and a URL in the following format:

[Link](http://link.com)

\[(.+)]\(((https?:\/\/(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,}))\)

When I add another expression afterwards to linkify URLs, it messes up ones in the above format.

Is there a singular regular expression to handle both cases?

http://link.com -> <a href="http://link.com" target="_blank">http://link.com</a>

[Link](http://link.com) -> <a href="http://link.com" target="_blank">Link</a>

PHP:

$string = preg_replace('/\[(.+)]\(((https?:\/\/(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,}))\)/', '<a href="$2" target="_blank">$1</a>', $string);

Obligatory ["You can't parse HTML with regex"](http://stackoverflow.com/a/1732454/1270789) link. — Ken Y-N, Jun 16 '16 at 00:58
[This thread](http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www) might help you match stand alone URLs. — HamZa, Jun 16 '16 at 01:08

Casimir et Hippolyte · Answer 1 · 2016-06-16T01:42:45.660

There's no real ways to identify an url in a string since the url syntax can be very complicated (too complicated to be clear). In other words, you must accept that something that looks like [...](...) stands for a link without to try to verify if the content between ( and ) is really an URL. (You can always use parse_url after, but keep in mind that it may exclude valid urls).

What you are looking for is:

$result = preg_replace('~\[([^]]*)]\([^)]*\)~', '<a href="$2" target="_blank">$1</a>', $str);

// If you want to hunt lonely urls in your text, you can always search
// after extracting text nodes with XPath and a naive pattern like this:

$dom = new DOMDocument;
$dom->loadHTML($result);

$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()');

foreach($textNodes as $textNode) {
    $textNode->nodeValue = preg_replace('~[hw](?:(?<=\bh)ttps?://|(?<=\bw)ww\.)\S+~i', '<a href="$0" target="_blank">$0</a>~', $textNode->nodeValue);
}

$result = $dom->saveHTML();

Note: for better results, if you absolutely want to check the url, you can use the same pattern with preg_replace_callback, remove the last character of the match until parse_url works and perform the replacement, but it will not be very performant.

My code was working for [...](...) though... what I meant was that doing the lonely-url preg_replace ended up breaking the the ones in the [...](...) format. I can't seem to get yours working either. — frosty, Jun 16 '16 at 01:17
@frosty: I have edited my answer, you must now apply this last pattern only on text nodes after extracting them with XPath to avoid the problem. — Casimir et Hippolyte, Jun 16 '16 at 01:20

score 0 · Answer 2 · answered Jun 16 '16 at 01:36

Maybe this help you a bit:

/**
 * Linkify Function
 * @param $tweet
 * @return mixed
 */
function linkify_tweet($tweet)
{
//Convert urls to <a> links
$tweet = preg_replace("/([\w]+\:\/\/[\w-?&;#~=\.\/\@]+[\w\/])/", "<a href=\"mailto:w2m@bachecubano.com?subject=WEB $1\">$1</a>", $tweet);

//Convert hashtags to twitter searches in <a> links
$tweet = preg_replace("/#([A-Za-z0-9\/\.]*)/", "<a href=\"#\">#$1</a>", $tweet);

//Convert attags to twitter profiles in <a> links
$tweet = preg_replace("/@([A-Za-z0-9\/\.]*)/", "<a href=\"mailto:w2m@bachecubano.com?subject=MSG @$1\" class=\"userlink\">@$1</a>", $tweet);

return $tweet;
}

score 0 · Answer 3 · answered Jun 16 '16 at 03:01

First deal with markdown syntax. Then catch plain links that were not processed - you may use similar regexp, but without parethesis. If you want to replace everything that looks like an url within whitespace limits (html won't match) then this will do:

\s(https?:\/\/(?:www\.|(?!www))[^\s.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,})

Regex to linkify URLs

3 Answers3