regular expression to parse URLs to links, but only if they are not links yet

Question

We use the following regular expression to convert URLs in text to links, which are shortened with ellipsis in the middle if they are too long:

/**
 * Replace all links with <a> tags (shortening them if needed)
 */
$match_arr[] = '/((http|ftp)+(s)?:\/\/[^<>\s,!\)]+)/ie';
$replace_arr[] = "'<a href=\"\\0\" title=\"\\0\" target=\"_blank\">' . " .
    "( mb_strlen( '$0' ) > {$maxlength} ? mb_substr( '$0', 0, " . ( $maxlength / 2 ) . " ) . '…' . " .
    "mb_substr( '$0', -" . ( $maxlength / 2 ) . " ) : '$0' ) . " .
"'</a>'";

This is working. However, I found that if there is a link in the text already, like:

$text = '... <a href="http://www.google.com">http://www.google.com</a> ...';

it will match both URLs, so it will try to create two more <a> tags, totally messing up the DOM of course.

How can I prevent the regex from matching if the link is already inside an <a> tag? It will also be in the title attribute, so basically I just want to skip every <a> tag completely.

How about stripping all anchor-tags first instead of having to overcomplicate your current regexp? — ccKep, Jul 01 '13 at 12:52
_Or_ how about parsing the DOM, processing the nodeValues, and _skipping_ all `a` tags, which is how we process markup: by parsing it — Elias Van Ootegem, Jul 01 '13 at 12:53
That's a good idea, but how would you do it? [`strip_tags`](http://php.net/manual/en/function.strip-tags.php) only allows for a whitelist of tags.. — Rijk, Jul 01 '13 at 12:53
Did you know that the `/e` modifier [has been deprecated](http://php.net/manual/de/reference.pcre.pattern.modifiers.php) for quite a while now? — Tim Pietzcker, Jul 01 '13 at 12:55
@EliasVanOotegem I feel this would be overkill.. I know how to do DOM parsing, but this feels like a perfect application of regular expressions; find text that matches a URL and convert them into something else. I don't know how to do this with DOM parsing; http://stackoverflow.com/questions/16346961/php-autolink-if-not-already-linked also uses a regex in the end. To set up a complete DOM parser, just to be able to skip over links.. I don't know. — Rijk, Jul 01 '13 at 13:03
@Rijk Yes, it's using a regex, but the context of how the regex is being used is much different than yours. In that question, you're applying the regex to nodes which A) you know are text nodes within the DOM (eliminating the possibility of identifying links within `` tag attributes) and 2) you know that the parent node in the DOM of that text node is not within an `` tag. Those are two pieces of information that are required to ensure you are only identifying links that aren't already linked. A regex in this context, which is more brittle, will attempt to identify the exact same info. — nickb, Jul 01 '13 at 13:48

Tim Pietzcker · Accepted Answer · 2013-07-01T12:59:43.050

1

The simplest way (with a regex, which arguably is not the most reliable tool in this situation) would probably be to make sure that no </a> follows after your link:

#(http|ftp)+(s)?://[^<>\s,!\)]++(?![^<]*</a>)#ie

I'm using possessive quantifiers to make sure that the entire URL will be matched (i. e. no backtracking in order to satisfy the lookahead).

edited Jul 01 '13 at 12:59

answered Jul 01 '13 at 12:53

Tim Pietzcker

328,213
58
503
561

http://test.com oops – Karoly Horvath Jul 01 '13 at 12:55
There's a `)` missing somewhere.. – Rijk Jul 01 '13 at 12:58
Right. The outer parentheses can be removed entirely. – Tim Pietzcker Jul 01 '13 at 13:00
This is working perfectly! :) I'll look into the possessive quantifiers thing. – Rijk Jul 01 '13 at 13:06

regular expression to parse URLs to links, but only if they are not links yet

1 Answers1