2

I have a collection of text that I am trying to process with PHP dynamically (the data comes from an XML file), however I want to strip the a link and the text that is linked.

PHP's strip_tags takes out the <a etc...> and </a> but not the text in between.

I am currently trying to use the Regex preg_replace('#(<a.*?>).*?(</a>)#', '', $content);

Another thing to note is the links have styles, classes, href and titles.

Does anyone know the solution?

Pez Cuckow
  • 14,048
  • 16
  • 80
  • 130
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Dec 10 '10 at 16:18
  • 1
    For reference, you've grouped the anchor tags but not the content, which is where the problem lies. preg_replace replaces the grouped element (those included in parenthesis). You can try the following though: `#(]*?>.*?)#i` (i flag for a case insensitive compare) – Brad Christie Dec 10 '10 at 16:26
  • 1
    briefly tested shorter regex version, just for fun :) `preg_replace ('/<(?:a|\/)[^>]*>/', '', $data);` – cyber-guard Dec 10 '10 at 16:47

5 Answers5

3

try this:

$content=preg_replace('/<a[^>]*>(.*)<\/a>/iU','',$content);
profitphp
  • 8,104
  • 2
  • 28
  • 21
  • Awesome! Now I see the reason for learning regular expressions well! And how do I strip tags but not the ones with " – Adam Arold Mar 22 '11 at 19:45
2

You can use DOMDocument, for example (untested!):

$doc = new DOMDocument();
$doc->loadHTMLFile('foo.php');
$domNodeList = $doc->getElementsByTagname('a'); 
$len = count($domNodeList);
for($i = 0; $i < $len; $i++) {
    $domNodeList[$i]->parentNode->removeChild($domNodeList[$i]);
}
$doc->saveHTMLFile('output.html');

Or using Simple HTML DOM Parser:

$html = file_get_html('http://www.example.com/');
foreach($html->find('a') as $element) { 
   $element->outertext = '';
}
$html->save('output.html');
karim79
  • 339,989
  • 67
  • 413
  • 406
  • @Cyber-Guard Design - I don't think it is overly complicated. And it will certainly be more reliable than a regular expression. – karim79 Dec 10 '10 at 16:56
0

Because the a-Element is not the online one, that can break your page, you better should use a whitelist approach, like strip_tags().

KingCrunch
  • 128,817
  • 21
  • 151
  • 173
  • 1
    Sorry really have no idea what you mean...? – Pez Cuckow Dec 10 '10 at 16:10
  • I dont know exactly, what you want, but usually you should specify, which tags are allowed, and not, which are not allowed. If you want to remove the tags because of security issues, think of _iframe_, _img_, or _link_. – KingCrunch Dec 10 '10 at 16:14
0

I used the solution(s) posted as comments, they seemed to work best and were exactly what I was looking for!

"For reference, you've grouped the anchor tags but not the content, which is where the problem lies. preg_replace replaces the grouped element (those included in parenthesis). You can try the following though: #(<a[^>]*?>.*?</a>)#i (i flag for a case insensitive compare)" – Brad Christie

"briefly tested shorter regex version, just for fun :) preg_replace ('/<(?:a|\/)[^>]*>/', '', $data);" – Cyber-Guard Design yesterday

Pez Cuckow
  • 14,048
  • 16
  • 80
  • 130
-1

With regex, but not thoroughly tested

echo preg_replace('#(<a.*?>)(.*?)(<\/a>)#','$2', $str);

Also, the limit argument set to -1 will set it to no limit.

John Giotta
  • 16,432
  • 7
  • 52
  • 82