4

I've been working on a little project, and I find myself in a position where I need a php function which can linkify URLs in my data, while enabling me to set some exceptions on links I don't want to linkify. Any idea of how to do this?

Hirvesh
  • 7,636
  • 15
  • 59
  • 72
  • Are links easily identifiable? Properly formed "http://example.com/someuri"? Or just some random-ish text that could be a hostname and possibly a directory or query? – Marc B Feb 22 '11 at 16:23
  • I need to be able to filter out youtube links - so it might be http://www.youtube.com/watch?v=cqhsunMOWKQ&feature=feedu – Hirvesh Feb 22 '11 at 16:25
  • and many other links of different media sites like flickr, vimeo, etc – Hirvesh Feb 22 '11 at 16:26
  • Are you sure you wanna do that in PHP? It's relatively easy to do this in javascript using JQuery. – Stofke Feb 22 '11 at 17:02
  • yep, I need it in php, because I need the data to work on. Isn't it possible to create a function which passes an array of top level domains to filter out and just linkify the rest? I"m stuck here. – Hirvesh Feb 22 '11 at 17:04

3 Answers3

12

I have an open source project on GitHub: LinkifyURL which you may want to consider. It has a function: linkify() which plucks URLs from text and converts them to links. Note that this is not a trivial task to do correctly! (See: The Problem With URLs - ands be sure to read the thread of comments to grasp all the things that can go wrong.)

If you really need to NOT linkify specific domains (i.e. vimeo and youtube), here is a modified PHP function linkify_filtered (in the form of a working test script) that does what you need:

<?php // test.php 20110313_1200

function linkify_filtered($text) {
    $url_pattern = '/# Rev:20100913_0900 github.com\/jmrware\/LinkifyURL
    # Match http & ftp URL that is not already linkified.
      # Alternative 1: URL delimited by (parentheses).
      (\()                     # $1  "(" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $2: URL.
      (\))                     # $3: ")" end delimiter.
    | # Alternative 2: URL delimited by [square brackets].
      (\[)                     # $4: "[" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $5: URL.
      (\])                     # $6: "]" end delimiter.
    | # Alternative 3: URL delimited by {curly braces}.
      (\{)                     # $7: "{" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $8: URL.
      (\})                     # $9: "}" end delimiter.
    | # Alternative 4: URL delimited by <angle brackets>.
      (<|&(?:lt|\#60|\#x3c);)  # $10: "<" start delimiter (or HTML entity).
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $11: URL.
      (>|&(?:gt|\#62|\#x3e);)  # $12: ">" end delimiter (or HTML entity).
    | # Alternative 5: URL not delimited by (), [], {} or <>.
      (                        # $13: Prefix proving URL not already linked.
        (?: ^                  # Can be a beginning of line or string, or
        | [^=\s\'"\]]          # a non-"=", non-quote, non-"]", followed by
        ) \s*[\'"]?            # optional whitespace and optional quote;
      | [^=\s]\s+              # or... a non-equals sign followed by whitespace.
      )                        # End $13. Non-prelinkified-proof prefix.
      ( \b                     # $14: Other non-delimited URL.
        (?:ht|f)tps?:\/\/      # Required literal http, https, ftp or ftps prefix.
        [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]+ # All URI chars except "&" (normal*).
        (?:                    # Either on a "&" or at the end of URI.
          (?!                  # Allow a "&" char only if not start of an...
            &(?:gt|\#0*62|\#x0*3e);                  # HTML ">" entity, or
          | &(?:amp|apos|quot|\#0*3[49]|\#x0*2[27]); # a [&\'"] entity if
            [.!&\',:?;]?        # followed by optional punctuation then
            (?:[^a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]|$)  # a non-URI char or EOS.
          ) &                  # If neg-assertion true, match "&" (special).
          [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]* # More non-& URI chars (normal*).
        )*                     # Unroll-the-loop (special normal*)*.
        [a-z0-9\-_~$()*+=\/#[\]@%]  # Last char can\'t be [.!&\',;:?]
      )                        # End $14. Other non-delimited URL.
    /imx';
//    $url_replace = '$1$4$7$10$13<a href="$2$5$8$11$14">$2$5$8$11$14</a>$3$6$9$12';
//    return preg_replace($url_pattern, $url_replace, $text);
    $url_replace = '_linkify_filter_callback';
    return preg_replace_callback($url_pattern, $url_replace, $text);
}
function _linkify_filter_callback($m)
{ // Filter out youtube and vimeo domains.
    $pre  = $m[1].$m[4].$m[7].$m[10].$m[13];
    $url  = $m[2].$m[5].$m[8].$m[11].$m[14];
    $post = $m[3].$m[6].$m[9].$m[12];
    if (preg_match('/\b(?:youtube|vimeo)\.com\b/', $url)) {
        return $pre . $url . $post;
    } // else linkify...
    return $pre .'<a href="'. $url .'">' . $url .'</a>' .$post;
}

// Create some test data.
$data = 'Plain URLs (not delimited):
foo http://example.com bar...
foo http://example.com:80 bar...
foo http://example.com:80/path/ bar...
foo http://example.com:80/path/file.txt bar...
foo http://example.com:80/path/file.txt?query=val&var2=val2 bar...
foo http://example.com:80/path/file.txt?query=val&var2=val2#fragment bar...
foo http://example.com/(file\'s_name.txt) bar... (with \' and (parentheses))
foo http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7348] bar... ([IPv6 literal])
foo http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7348]/file.txt bar... ([IPv6] with path)
foo http://youtube.com bar...
foo http://youtube.com:80 bar...
foo http://youtube.com:80/path/ bar...
foo http://youtube.com:80/path/file.txt bar...
foo http://youtube.com:80/path/file.txt?query=val&var2=val2 bar...
foo http://youtube.com:80/path/file.txt?query=val&var2=val2#fragment bar...
foo http://youtube.com/(file\'s_name.txt) bar... (with \' and (parentheses))
foo http://vimeo.com bar...
foo http://vimeo.com:80 bar...
foo http://vimeo.com:80/path/ bar...
foo http://vimeo.com:80/path/file.txt bar...
foo http://vimeo.com:80/path/file.txt?query=val&var2=val2 bar...
foo http://vimeo.com:80/path/file.txt?query=val&var2=val2#fragment bar...
foo http://vimeo.com/(file\'s_name.txt) bar... (with \' and (parentheses))
';
// Verify it works...
echo(linkify_filtered($data) ."\n");

?>

This employs a callback function to do the filtering. Yes, the regex is complex (but so it the problem as it turns out!). You can see the interactive Javascript version of linkify() in action here: URL Linkification (HTTP/FTP).

Also, John Gruber has a pretty good regex to do linkification. See: An Improved Liberal, Accurate Regex Pattern for Matching URLs. However, his regex suffers catastrophic backtracking under certain circumstances. (I've written to him about this, but he has yet to respond.)

Hope this helps! :)

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • 1
    @ridgerunner How would you modify this to allow for user-entered URL's without a prefix e.g. just www.something.com? – stco Feb 06 '13 at 19:59
  • 1
    @stco - No time right at the moment, but I've had other people ask this same question. Another developer has provided a solution with [this pull request patch on my LinkifyURL github project](https://github.com/jmrware/LinkifyURL/pull/4), but I have not had time to check it out yet. (I'm not sure if it works or not but you may want to take a look.) – ridgerunner Feb 08 '13 at 15:38
  • +1 for posting the regex of the matrix... I looked at it for hours while scrolling up and down then took the red pill. :) – zx81 Jun 09 '14 at 10:11
2

Well, this is quite ok, but there is still huge problem, existing anchors within the string, I would like to linkify text without transforming those into a senseless piece of sh.t.

Galvani
  • 239
  • 2
  • 9
1
$string = "some text and a link http://www.google.com"
$new_string = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]","<a href=\"\\0\">\\0</a>", $string)

or use:

http://code.iamcal.com/php/lib_autolink/

Stofke
  • 2,928
  • 2
  • 21
  • 28
  • what about filtering out some kind of links (like youtube/vimeo) so that they are not linkified? – Hirvesh Feb 22 '11 at 17:10
  • I'd filter those out first than do the regex to linkify. What needs to be filtered out? – Stofke Feb 22 '11 at 17:12
  • not literally remove them from content, but prevent them from being linkified. – Hirvesh Feb 22 '11 at 17:14
  • so you have an array with links to start from? – Stofke Feb 22 '11 at 17:20
  • nope, It's just links of a specific type. Say I don't want youtube links to get linkify, I need a way to specify that to the script so that youtube links don't get linkified. Similarly, if I want vimeo links not to get linkified along with youtube links, there should be a way to specify that... – Hirvesh Feb 22 '11 at 17:23
  • I don't really understand how php is supposed to get access to those links. Does it comes from a database, do you process a html page, is it in a buffer? This does impact the way you process the links. Is there a fixed list of links to be excluded or can there be other new links in the future. If the latter a regex will become difficult as it has to include more and more exceptions. – Stofke Feb 22 '11 at 18:50
  • The links will be user inputted and will be mixed along with text, stored in a variable then we need to process the variable to linkify those links which are allowed to be linkified, and not linkify those of yotube types. – Hirvesh Feb 23 '11 at 02:46
  • Have a look here : http://stackoverflow.com/questions/1872933/php-regex-for-filtering-out-urls-from-specific-domains-for-use-in-a-vbulletin-plu maybe this helps – Stofke Feb 23 '11 at 23:00