0

I receive a text formatted as html. I want to restrict anchor tag's urls to be only from my domain replacing the old links with "xxx" (or smth' else).
Input: "<a href='otherdomain'>text</a>"
Output: "xxx"
I am using regexp to achieve this though I'm kind of stuck here:

$pattern ='/<a.*href=[\'|\"]http.?:\/\/[^mydomain.*\"\']*[\'|\"].*<\/a>/i';
$replace ='xxx';
echo preg_replace($pattern, $replace, $string); 

What is wrong here?

lvil
  • 4,326
  • 9
  • 48
  • 76

4 Answers4

2

When you do [^mydomain.*\"\'] you are saying "match any character except a literal 'm', 'y', 'd', 'o', ..., '.', '*', etc.

Try something like:

#<a [^>]*\bhref=(['"])http.?://((?!mydomain)[^'"])+\1 *>.*?</a>#i

Notes:

  • I turned your a.*href to a [^>]*\bhref to make sure that the 'a' and 'href' are whole words and that the regex doesn't match over multiple tags.
  • I changed the regex delimiter character to '#' instead of '/' so you don't have to escape the / any more
  • Note the ((?!mydomain)[^'"])+. This means "match [^'"]+ that isn't mydomain". The (?! is called a negative look-ahead.
  • Note the \1. This makes sure that the closing quote mark for the URL is the same as the opening quote mark (see hwo the first set of brackets captures the ['"]?). You'd probably be fine without it if you prefered.

For PHP (updated because I always mix up when backslashes need to be escaped in PHP -- see @GlitchMr's comment below):

$pattern = '#<a [^>]*\bhref=([\'"])http.?://((?!mydomain)[^\'"])+\1 *>.*?</a>#i';

See it in action here, where you can tweak it to your purposes.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Thank you. It works perfectly. Could you please explain how you achieve matching of characters before and after mydomain? Like "sub.mydomain.com/page1"? – lvil Feb 12 '12 at 12:38
  • If you try changing one of the URLs to "sub.mydomain.com/page1" in the link to the interactive example I posted, you will see that that is also not matched. Or do you *want* to match 'sub.mydomain.com/page1' but not 'mydomain.com' ? – mathematical.coffee Feb 12 '12 at 12:40
  • You don't need to escape backslashes in PHP, unless those escape backslashes. In single-quotes, only ``\\`` and ``\'`` are interpreted. So, if your regular expression looks like `'/^\s*$/'`, PHP sends `/^\s*$/` to regexp engine, because ``\`` doesn't escape ``\`` or `'`. The only character that is problematic when using regexp is ``\`` itself, for it you have to use ``\\\\``. PHP converts ``\\\\`` to ``\\``, and regexp engine itself converts it to ``\``. As for ``\'``, `'` is normal literal character in regexp, so PHP can safely send it without backslash to regexp engine. – Konrad Borowski Feb 12 '12 at 13:00
  • Ahh thanks @GlitchMr -- I always get these confused (am used to python's `r"whateveryoulike"` format) – mathematical.coffee Feb 12 '12 at 23:17
  • Instead of "http.?://" would it not much better "[a-z]+://"? Here are more schemas as "http://" or "https://". Like "ftp://" thats too a external request and maybe more. – user706420 May 15 '18 at 09:00
2

Here's a part of the code I'm using. It's using a user function to change the text dug out by the regex. Good luck :)

class RedirectLinks {
    /**
     * Callback used by convert_external_links_to_internal on each url found
     *
     * @param array $matches
     * @return string
     */
    public static function urlMatchCallback($matches)
    {
        if (stripos($matches[1], 'http://') === false ||
            stripos($matches[1], 'example.com') !== false
            ) {
            return $matches[0]; // do not modify
        }
        // encrypt url for redirection          
        $sURL = $matches[1];
        return "href=\"#\" onclick=\"showmessage('$sURL');\"";
    }

    /**
     * Converts external links in text to internal ones
     *
     * @param string $str - text
     * @return the processed text
     */
    public static function convertExternalLinksToInternal($str) {
        // convert external links to internal redirections
        $str = preg_replace_callback("/href=\"([^\"]*)\"/is", 'RedirectLinks::urlMatchCallback', $str);

        return $str;    
    }
}
Collector
  • 2,034
  • 4
  • 22
  • 39
1

(Albeit that's no reason not to explain something.)

If you want to match 'anything but' then you usually want to use an assertion; a negative lookahead assertion in your case:

 (?!mydomain\.com).*?

This will match .*? anything, but the disallowed value which precedes it.

Also take note that:

  • It should be [\"\'] and not [\'|\"]. The alternative sign has no meaning in character classes.
  • .* should usually be .*? to not match too broadly.
  • And [^>]* is the common idiom to match within tags.
  • You can use other delimiters #<a...*>#i in place of / to avoid escaping.
mario
  • 144,265
  • 20
  • 237
  • 291
0

[] is the character-in-set operator. Your pattern would be a lot more understandable as

$pattern ='!<a\s.*?\shref\s*=\s*([\'"])https?:://mydomain.*?\1.*?</a>!is';

Note:

  • I've whitespace delimited the tokens
  • swapped the regexp quoting char to avoid the \/
  • Use a back reference to match the quotes.
TerryE
  • 10,724
  • 5
  • 26
  • 48