Strip entire html link (including text) with PHP

Question

I have a collection of text that I am trying to process with PHP dynamically (the data comes from an XML file), however I want to strip the a link and the text that is linked.

PHP's strip_tags takes out the <a etc...> and </a> but not the text in between.

I am currently trying to use the Regex preg_replace('#(<a.*?>).*?(</a>)#', '', $content);

Another thing to note is the links have styles, classes, href and titles.

Does anyone know the solution?

*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Dec 10 '10 at 16:18
For reference, you've grouped the anchor tags but not the content, which is where the problem lies. preg_replace replaces the grouped element (those included in parenthesis). You can try the following though: `#(]*?>.*?)#i` (i flag for a case insensitive compare) — Brad Christie, Dec 10 '10 at 16:26
briefly tested shorter regex version, just for fun :) `preg_replace ('/<(?:a|\/)[^>]*>/', '', $data);` — cyber-guard, Dec 10 '10 at 16:47

profitphp · Answer 1 · 2010-12-10T16:37:47.457

3

try this:

$content=preg_replace('/<a[^>]*>(.*)<\/a>/iU','',$content);

edited Dec 10 '10 at 16:37

answered Dec 10 '10 at 16:16

profitphp

8,104
2
28
21

Awesome! Now I see the reason for learning regular expressions well! And how do I strip tags but not the ones with " – Adam Arold Mar 22 '11 at 19:45

karim79 · Answer 2 · 2010-12-10T16:20:55.383

2

You can use DOMDocument, for example (untested!):

$doc = new DOMDocument();
$doc->loadHTMLFile('foo.php');
$domNodeList = $doc->getElementsByTagname('a'); 
$len = count($domNodeList);
for($i = 0; $i < $len; $i++) {
    $domNodeList[$i]->parentNode->removeChild($domNodeList[$i]);
}
$doc->saveHTMLFile('output.html');

Or using Simple HTML DOM Parser:

$html = file_get_html('http://www.example.com/');
foreach($html->find('a') as $element) { 
   $element->outertext = '';
}
$html->save('output.html');

edited Dec 10 '10 at 16:20

answered Dec 10 '10 at 16:15

karim79

339,989
67
413
406

@Cyber-Guard Design - I don't think it is overly complicated. And it will certainly be more reliable than a regular expression. – karim79 Dec 10 '10 at 16:56

score 0 · Answer 3 · answered Dec 10 '10 at 16:09

0

Because the a-Element is not the online one, that can break your page, you better should use a whitelist approach, like strip_tags().

answered Dec 10 '10 at 16:09

KingCrunch

128,817
21
151
173

1

Sorry really have no idea what you mean...? – Pez Cuckow Dec 10 '10 at 16:10
I dont know exactly, what you want, but usually you should specify, which tags are allowed, and not, which are not allowed. If you want to remove the tags because of security issues, think of _iframe_, _img_, or _link_. – KingCrunch Dec 10 '10 at 16:14

score 0 · Accepted Answer · answered Dec 12 '10 at 12:13

I used the solution(s) posted as comments, they seemed to work best and were exactly what I was looking for!

"For reference, you've grouped the anchor tags but not the content, which is where the problem lies. preg_replace replaces the grouped element (those included in parenthesis). You can try the following though: #(<a[^>]*?>.*?</a>)#i (i flag for a case insensitive compare)" – Brad Christie

"briefly tested shorter regex version, just for fun :) preg_replace ('/<(?:a|\/)[^>]*>/', '', $data);" – Cyber-Guard Design yesterday

score -1 · Answer 5 · answered Dec 10 '10 at 16:14

-1

With regex, but not thoroughly tested

echo preg_replace('#(<a.*?>)(.*?)(<\/a>)#','$2', $str);

Also, the limit argument set to -1 will set it to no limit.

answered Dec 10 '10 at 16:14

John Giotta

16,432
7
52
82

Strip entire html link (including text) with PHP

5 Answers5