12

I am trying to convert, from a textarea input ($_POST['content']), all urls to link.

$content = preg_replace('!(\s|^)((https?://)+[a-z0-9_./?=&-]+)!i', ' <a href="$2" target="_blank">$2</a> ', nl2br($_POST['content'])." ");
$content = preg_replace('!(\s|^)((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$2"  target="_blank">$2</a> ', $content." ");

Target link formats: www.hello.com or http(s)://(www).hello.com

But this seem to break any iframe, image or similar,

How is/are the right regex that will ignore urls in html tags?

Note: I know I need two expressions; one to detect no protocol links (like www.hello.com, so I need to prepend it) and another one to detect urls with protocol (so no need to prepend).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
Toni Michel Caubet
  • 19,333
  • 56
  • 202
  • 378

4 Answers4

19

Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a " in front of your URL and not a space, as your pattern requires.

However, here is different solution. It might not work 100% if you have single < or > within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing > before any opening < (because this means, you are inside a tag).

$content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." ");
$content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2"  target="_blank">$2</a> ', $content." ");

In case you are not familiar with this technique, here is a bit more elaboration.

(?!        # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match
[^<>]      # any character that is neither < nor >; the > is not strictly necessary but might help for optimization
*          # arbitrary many of those characters (but in a row; so not a single < or > in between)
>          # the closing >
)          # ends the lookahead subpattern

Note that I changed the regex delimiters, because I am now using ! within the regex.

Unless you need the first subpattern (\s|^) for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).

$content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");

And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1? If you missed this by accident, add the # to your allowed URL characters:

$content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");

EDIT: Also, what about + and %? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT

I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.

One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • This is exactly what I needed. Thank you! Is it ok to add the + and % to the string like that, or do they need a / – betaman Oct 24 '13 at 21:13
  • 1
    @betaman I suppose you meant a backslash? If you put them inside the character class, then they don't need to be escaped, no. Outside of a character class, the `+` has to be escaped, but the `%` does not. – Martin Ender Oct 25 '13 at 06:54
  • After trying many solutions this one is the one that did it. As I want to keep existing HTML intact and replace only links in the text. And I learned a bit more on regular expression. Thanks m.buettner! – betaman Oct 26 '13 at 08:45
  • This seems to work. An example of this code can be found on http://sandbox.onlinephpfunctions.com/code/ef9875afaebd2845729b523674e574e76606ca38 – user1432181 May 03 '21 at 11:09
17
  1. In my opinion url is everything that starts with https?:// and ends with space or end of the line (vertical space or so called new line).
  2. Because of the first point, images, links etc. will not be replaced, because they all start with " or > (except if link <a href=" http..."> starts with the space, but this is invalid html).
  3. Modifier /m tells the regex to match every line (so that matching described in the first point will work).
  4. Function nl2br() should be used after replacement (because of the links that start on the beginning of the line).
  5. Space before and after are added only if space originally exists in the $content (see $1 and $3 in the second parameter of the preg_replace() function).
  6. This solution supports domain names with special characters, like www.moški.si.

Input:

INPUT

Code:

<?php

$content =
    preg_replace(
        '~(\s|^)(https?://.+?)(\s|$)~im', 
        '$1<a href="$2" target="_blank">$2</a>$3', 
        $content
    );
$content = 
    preg_replace(
        '~(\s|^)(www\..+?)(\s|$)~im', 
        '$1<a href="http://$2" target="_blank">$2</a>$3', 
        $content
    );
$content = nl2br($content);

Output:

Output

Edit:

Example of links without https?:// prefixes + example of single preg_replace() call (patterns & replacements are array):

$content = 
    preg_replace(
        array(
            '~(\s|^)(www\..+?)(\s|$)~im', 
            '~(\s|^)(https?://)(.+?)(\s|$)~im', 
        ),
        array(
            '$1http://$2$3', 
            '$1<a href="$2$3" target="_blank">$3</a>$4', 
        ),
        $content
    );
$content = nl2br($content);

enter image description here

Glavić
  • 42,781
  • 13
  • 77
  • 107
  • The more downvotes, the less chances you getting that bounty if it gets to auto allocate. – oxygen Oct 01 '12 at 10:12
  • 7
    I don't care about bounty! I care about knowledge. If my answer is incorrect, I would like to know WHY. Is that to much to ask from downvoter's? – Glavić Oct 01 '12 at 11:15
  • I just told you what reason downvoters may have had to downvote. – oxygen Oct 01 '12 at 11:38
  • If that is true, I can write only this: OMG and LOL! If this is really the reason, I will never again reply to questions with bounty. – Glavić Oct 01 '12 at 13:19
  • @glavic I upvoted both your answer and m-buettner but note that he answered this correctly before you. I tested both of your answers and they both work albeit yours looks like a smaller (better) regex and doesn't include the restrictive a-z0-9 portion since domains names now can have many more characters and be in different languages – Anthony Hatzopoulos Oct 01 '12 at 14:41
  • @AnthonyHatzopoulos: I upvoted his answer to, but that is not the point. I don't like downvotes without the "backuping it up" part. I would like to know where I did wrong... Good point! Domain names with special characters will not be fetched in other examples. I edited the answer and added point 6.) Thanks for the tip and voteup ;-) p.s. even SO doesn't support domain names with special characters ;-) – Glavić Oct 01 '12 at 15:07
  • Great! But how can we avoid catching "www.sample..."? – Robin Carlo Catacutan Jun 26 '14 at 06:16
3

Let me suggest something less straight forward: split the input text into the html and non-html parts, then process the non-html parts with your regexp combining the text back into one piece. Smth. like:

  <?php
  $chunks = preg_split('/(<.*>)/Ums', $_POST['content'], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
  $result = '';
  foreach ($chunks as $chunk) {
    if (substr($chunk,0,1) != '<') {
      /* do your processing on $chunk */
    }
    $result .= $chunk;
  }

Some additional advices:

  1. try to save the source text and do the transformation when displaying it. This will allow you to improve/fix your rendering code if in future you find a new problem/idea.
  2. (https?://)+ shouldn't be in brackets and you don't need +, cause it matches "https://https://some.com" - just put https?://[a-z0-9_./?=&-]+
  3. the same about (www.)+ :)
disjunction
  • 646
  • 5
  • 8
3

This has been done hundreds of times over before. On this page either m-buettner and glavić work fine although I like glivic's shorter expression.

Here's a good php resource to do it: http://code.iamcal.com/php/lib_autolink/

Repeats on Stackoverflow:

Decent in-depth article: - http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/

Community
  • 1
  • 1
Anthony Hatzopoulos
  • 10,437
  • 2
  • 40
  • 57