1

For example, <a href="http://msdn.microsoft.com/art029nr/">remove links to here but keep text</a> but <a href="http://herpyderp.com">leave all other links alone</a>

I've been trying to solve this using preg_replace. I've searched through here and found answers that solve pieces of the problem.

The answer at PHP: Remove all hyperlinks of specific domain from text removes links to a specific url but removes the text also.

The site at http://php-opensource-help.blogspot.ie/2010/10/how-to-remove-hyperlink-from-string.html removes a hyperlink from a string but I can't seem to modify the pattern so that it applies only to a specific website.

Community
  • 1
  • 1
Danny
  • 53
  • 1
  • 7
  • 4
    [Do not parse HTML with regexes.](http://stackoverflow.com/a/1732454/344643) Use [an XML parser](http://us2.php.net/manual/en/class.domdocument.php) instead. – Waleed Khan Feb 11 '13 at 00:13

1 Answers1

5
$html = '...I can haz HTML?...';
$whitelist = array('herpyderp.com', 'google.com');

$dom = new DomDocument();
$dom->loadHtml($html);    
$links = $dom->getELementsByTagName('a');

foreach($links as $link){
  $host = parse_url($link->getAttribute('href'), PHP_URL_HOST);

  if($host && !in_array($host, $whitelist)){    

    // create a text node with the contents of the blacklisted link
    $text = new DomText($link->nodeValue);

    // insert it before the link
    $link->parentNode->insertBefore($text, $link);

    // and remove the link
    $link->parentNode->removeChild($link);
  }  

}

// remove wrapping tags added by the parser
$dom->removeChild($dom->firstChild);            
$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

$html = $dom->saveHtml();

For those scared to use DomDocument instead of preg_replace for performance reasons, I did a quick test between this and the code linked in the Q (the one that completely removes the links) => DomDocument is only ~4 times slower.

nice ass
  • 16,471
  • 7
  • 50
  • 89
  • Thanks very much. The url being a subdomain seemed to be causing a problem but I can get around this by entering just the first part. The only hperlinks that aren't removed are urls with a comma and throws Warning: DOMDocument::loadHTML(): htmlParseEntityRef . Do you know of a way around this. Thanks again. – Danny Feb 11 '13 at 02:23
  • If the HTML is malformed - disable errors (see [this answer](http://stackoverflow.com/a/7082487/1058140)). I only did a host check here. If you want to perform check against the entire url, paths and so on read the documentation page for `parse_url()` – nice ass Feb 11 '13 at 02:39