0

I have some links like this:

<a href="http://illegallink.com"><img src="something.jpg" /><a href="http://legallink.com">legal</a></a>

I want to remove all links that does not have "legallink.com" in it. But still keep the content. So the above input would output:

<img src="something.jpg" /><a href="http://legallink.com">legal</a>

It should work recursively through the links.

I found this regex that removes all links: /<\\/?a(\\s+.*?>|>)/, but I want it to keep links where href is legallink.com.

Can this be done with regex? Or should I use a DOM parser?

hakre
  • 193,403
  • 52
  • 435
  • 836
Trolley
  • 2,328
  • 2
  • 23
  • 28

1 Answers1

1
error_reporting(~0); display_errors(1);

$code = '<a href="http://illegallink.com"><img src="something.jpg" /><a href="http://legallink.com">legal</a></a>';

$document = new DOMDocument(); 
$document->loadHTML($code); 
$parser = new DOMXPath($document);  

foreach($parser->query("//a") as $node)  
{ 
  if (!preg_match("/^http:\/\/legallink.com/i", $node->getAttribute("href")))
  {
    $node->parentNode->replaceChild($node->nodeValue, $node);
  }
}
echo $document->saveXML();
hakre
  • 193,403
  • 52
  • 435
  • 836
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • I'm not downvoter, but I believe he wants to find nested links, not links with a specific href. he was just using the href as an example to say which link should be kept. – Jonathan Kuhn Apr 18 '12 at 22:37
  • 2
    @JonathanKuhn - I should not be downvoted for unclear OP question. Besides of that nobody else posted alternative answers. – Ωmega Apr 18 '12 at 22:41
  • thats why I didn't downvote. the question needs some clarification. – Jonathan Kuhn Apr 18 '12 at 22:42
  • @Elias: Please see the updated code, run it and tell us which error message you get. – hakre May 13 '12 at 16:12