0

Having following code to turn an URL in a message into HTML links:

$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
    "<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);

$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
    "\\1<a href=\"away?to=http://\\2\" target=\"_blank\">\\2</a>", $message);

It works very good with almost all links, except in following cases:

1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide

Problem here is the # and the : within the link, because not the complete link is transformed.

2) If someone just writes "www" in a message

Example: <a href="http://www">www</a>

So the question is about if there is any way to fix these two cases in the code above?

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
lickmycode
  • 2,069
  • 2
  • 19
  • 20
  • 1
    possible duplicate of [Replace URLs in text with HTML links](http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links) – hek2mgl Oct 30 '13 at 20:55
  • 1
    You will never find a regexp that will match all urls, and only urls. There are just too many different options. That being said, it might be faster to look for a good one online. – aurbano Oct 30 '13 at 20:55
  • @hek2mgl: Nope, don't need a complete other function, just a fix for the code above. – lickmycode Oct 30 '13 at 20:59
  • @Chevi: I'm not trying to find a regexp to match all urls, but I'm sure the code above can be easily extended for the two cases. – lickmycode Oct 30 '13 at 21:00

3 Answers3

2

Since you want to include the hash (#) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !. So, your regex should look like this:

$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);

Does this help?

Though, if you would like to be more along the specification (RCF 1738) you might want to exclude % which is not allowed in URLs. There are also some more allowed characters which you didn't include:

  • $
  • _
  • . (dot)
  • +
  • !
  • *
  • '
  • (
  • )

If you would include these chars, you should then delimiter your regex with %.

matewka
  • 9,912
  • 2
  • 32
  • 43
  • You are getting my upvote for your answer and nice explaination, everthing is correct what you are writing and works, however my question was just half answered (missing the second case) and excluding the `%` is not possible, because wikipedia is using the `%` a lot with foreign languages and also a lot websites using the `%20` for spaces within an URL. Anyway thanks a lot for your explaination, it will help a lot people to understand how it works. – lickmycode Oct 31 '13 at 04:06
1

Couple minor tweaks. Add \# and : to the first regex, then change the * to + in the second regex:

$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
    "<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);

$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
    "\\1<a href=\"away?to=http://\\2\" target=\"_blank\">\\2</a>", $message);
elixenide
  • 44,308
  • 16
  • 74
  • 100
  • Even I like the explaination of Matewka's answer which is also correctly, I will mark your answer as the right solution because it's the valid answer regarding both cases in my question. Thank you. – lickmycode Oct 31 '13 at 03:59
1

In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:

"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."

Example of code (not tested):

$message = preg_replace_callback(
    '~(?(DEFINE)
          (?<prot> (?>ht|f) tps?+ :// )         # you can add protocols here
      )
      (?>
          <a\b (?> [^<]++ | < (?!/a>) )++ </a>  # avoid links inside "a" tags
        |
          <[^>]++>                              # and tags attributes.
      ) (*SKIP)(?!)                             # makes fail the subpattern.
      |                                         # OR
      \b(?>(\g<prot>)|www\.)(\S++)              # something that begins with
                                                # "http://" or "www."
     ~xi',
    function ($match) {
        if (filter_var($match[2], FILTER_VALIDATE_URL)) {
            $url = (empty($match[1])) ? 'http://' : '';
            $url .= $match[0];
            return '<a href="away?to=' . $url . '"target="_blank">'
                 . $url . '</a>';
        } else { return $match[0] }
    },
    $message);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125