3

I am trying to convert URLs in a piece of text into hyperlinks - using regular expressions. I have managed to achieve this but the problem is when there are already existing links in the text

so

bla bla blah www.google.com bla blah <a href="www.google.com">www.google.com</a>

should result in

bla bla blah <a href="http://www.google.com">www.google.com</a> bla blah <a href="www.google.com">www.google.com</a> 

not

bla bla blah <a href="http://www.google.com">www.google.com</a> bla blah <a href="<a href="http://www.google.com">www.google.com</a></a>"><a href="http://www.google.com">www.google.com</a></a>
Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
Ben
  • 2,661
  • 28
  • 31
  • Have you even *tried* googling for this problem? This has been through here so many times that it's not even funny anymore (sorry if this sounds dismissive, it's just a fact). Look at: http://www.google.com/search?q=url+links+regex+replace+site%3Astackoverflow.com – Tomalak Jun 11 '09 at 13:05
  • 1
    Tomalak, read the question. This problem is more complicated than what you find with that google search – amarillion Jun 11 '09 at 13:26
  • 1
    @amarillion: Bits and parts of the problem have been discussed here to no end. Even this exact question has been here. And every time it burns down to "don't do HTML with regex", and "matching URLs in a text is hard and impossible in the corner cases". This question will without a doubt burn down to that as well. – Tomalak Jun 11 '09 at 13:49
  • @Ben: Don't take it personally, I did not intend to fend off a newbie. Now that I've head breakfast - welcome to Stack Overflow. ;-) – Tomalak Jun 11 '09 at 18:30

4 Answers4

3

Finally finished it:

function add_url_links($data)
{
        $data = preg_replace_callback('/(<a href=.+?<\/a>)/','guard_url',$data);

        $data = preg_replace_callback('/(http:\/\/.+?)([ \\n\\r])/','link_url',$data);
        $data = preg_replace_callback('/^(http:\/\/.+?)/','link_url',$data);
        $data = preg_replace_callback('/(http:\/\/.+?)$/','link_url',$data);

        $data = preg_replace_callback('/{{([a-zA-Z0-9+=]+?)}}/','unguard_url',$data);

        return $data;
}

function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return guard_url(array('','<a href="'.$arr[1].'">'.$arr[1].'</a>')).$arr[2]; }
  • Your solution is innovative but I feel that it could be much simpler and faster if your regex language has look-behinds - simply add `(?<!href=")` to the beginning of your conversion expression. – Nicole Feb 24 '10 at 17:20
3

This is almost impossible to do with a single regular expression. I would instead recommend a state-machine based approach. Something like this (in pseudo-code)

state = OUTSIDE_LINK
for pos (0 .. length input)
   switch state
   case OUTSIDE_LINK
     if substring at pos matches /<a/
       state = INSIDE_LINK
     else if substring at pos matches /(www.\S+|\S+.com|\S+.org)/
       substitute link
   case INSIDE_LINK
     if substring at post matches /<\/a>/
       state = OUTSIDE_LINK
amarillion
  • 24,487
  • 15
  • 68
  • 80
  • 1
    @Tomalak - apologies, I did try my best to search for similar questions before - and found similar posts, but none that answered my question @amarillion Thanks very much, that works. I am sure there must be a way to do it using negative lookbacks? However this answer is perfect for what I was trying to do. – Ben Jun 11 '09 at 15:40
2

Another way of doing it (in php)

    $strParts = preg_split( '/(<[^>]+>)/', $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );
    foreach( $strParts as $key=>$part ) {

        /*check this part isn't a tag or inside a link*/
        if( !(preg_match( '@(<[^>]+>)@', $part ) || preg_match( '@(<a[^>]+>)@', $strParts[$key - 1] )) ) {
            $strParts[$key] = preg_replace( '@((http(s)?://)?(\S+\.{1}[^\s\,\.\!]+))@', '<a href="http$3://$4">$1</a>', $strParts[$key] );
        }

    }
    $html = implode( $strParts );
Ben
  • 2,661
  • 28
  • 31
1

Another trick is to guard all the existing links by encoding the code, then replacing urls with links, and then un-encoding the guarded links.

$data = 'test http://foo <a href="http://link">LINK</a> test';

$data = preg_replace_callback('/(<a href=".+?<\/a>)/','guard_url',$data);

$data = preg_replace_callback('/(http:\/\/.+?)([ .\\n\\r])/','link_url',$data);

$data = preg_replace_callback('/{{([a-zA-Z0-9+]+?)}}/','unguard_url',$data);

print $data;

function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return '<a href="'.$arr[1].'">'.$arr[1].'</a>'.$arr[2]; }

The code above is just a proof of concept, and doesn't handle all situations. Still, you can see that the code is pretty straightforward.