10

I need to remove anchor tags from some text, and can't seem to be able to do it using regex.
Just the anchor tags, not their content.
For instance, <a href="http://www.google.com/" target="_blank">google</a> would become google.

Lior
  • 5,454
  • 8
  • 30
  • 38

7 Answers7

14

Exactly, it cannot be done properly using a regular expression.

Here is an example using DOM :

$xml = new DOMDocument(); 
$xml->loadHTML($html); 

$links = $xml->getElementsByTagName('a');

//Loop through each <a> tags and replace them by their text content    
for ($i = $links->length - 1; $i >= 0; $i--) {
    $linkNode = $links->item($i);
    $lnkText = $linkNode->textContent;
    $newTxtNode = $xml->createTextNode($lnkText);
    $linkNode->parentNode->replaceChild($newTxtNode, $linkNode);
}

It's important to loop backward whenever changes will be made to the DOM.

Yann Milin
  • 1,335
  • 1
  • 11
  • 22
  • nice answer but how do i use it?..not really clear on usage. do i just echo out $newTxtNode? or lnkText??? – jcobhams Sep 25 '13 at 02:20
  • @VyrenMedia Op asked how to replace links by their text content, so at the end of this loop, you have a `DOMDocument` object with no links. You can use `$xml->saveHTML();` to get the whole html result. $lnkText contains the current link text as string, and you might want to [trim](http://php.net/trim) it. – Yann Milin Sep 25 '13 at 14:09
  • thanks a lot for your reply @Yann-Milin I however found a regex solution for this problem. – jcobhams Sep 25 '13 at 16:18
  • See below for the regular expression, the statement "_it cannot be done properly using a regular expression_." seems not to be true. – LarS Jul 08 '15 at 14:17
  • What I was trying to say is that any regular expression solution for this is not a good solution. You obviously _can_ run a regular expression query against html text, but it doesn't mean that you _should_ :) interesting read on the subject : [here](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/) and [here](http://stackoverflow.com/a/1732454/165969) – Yann Milin Jul 09 '15 at 18:06
11

Then you can try

preg_replace('/<\/?a[^>]*>/','',$Source);

I tried it online here on rubular

stema
  • 90,351
  • 20
  • 107
  • 135
6

This question has been answered already but I thought I would add my solution to the mix. I like this better than the accepted solution because its a bit more to the point.

$content = 
    preg_replace(array('"<a href(.*?)>"', '"</a>"'), array('',''), $content);
Vikdor
  • 23,934
  • 10
  • 61
  • 84
user1491929
  • 654
  • 8
  • 16
6

You are looking for strip_tags().

<?php

// outputs 'google'
echo strip_tags('<a href="http://www.google.com/" target="_blank">google</a>');
user229044
  • 232,980
  • 40
  • 330
  • 338
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 2
    I need to maintain other tags, I only need to remove anchors. – Lior May 03 '11 at 13:35
  • @Lior ah, I see. `strip_tags` does indeed not do that. There is an implementation in the user contributed notes that may help you: http://php.net/manual/en/function.strip-tags.php#100054 – Pekka May 03 '11 at 13:36
  • @Pekka You can pass a second argument to `strip_tags()` that is a string of "allowable_tags": http://php.net/manual/en/function.strip-tags.php. – Jasper Sep 17 '12 at 16:40
  • @Jasper but that won't help here, will it? He would have to specify all tags that exist in `$allowable_tags` – Pekka Sep 17 '12 at 16:41
  • @Pekka It is unfortunate that you have to blacklist rather than being able to whitelist what tags you want to remove but using some knowledge of that type of content is being parsed you can probably get that blacklist down to a small list. – Jasper Sep 17 '12 at 16:46
5

using regex:

preg_replace('/<a[^>]+>([^<]+)<\/a>/i','\1',$html);

CSᵠ
  • 10,049
  • 9
  • 41
  • 64
0

Much of the regex here did not help me. Some of it removes the content inside the anchor (which is not at all what OP asked for) and not all of the content at that, some of it will match any tag beginning with a, etc.

This is what I created for my needs at work. We had an issue where passing HTML to wkhtmltopdf that had anchor tags (with many data attributes and other attributes) would sometimes prevent the PDF from producing, so I wanted to remove those while keeping the text.

Regex:

/</?a( [^>]*)?>/ig

In PHP you can do:

$text = "<a href='http://www.google.com/'>Google1</a><br>" .
        "<a>Google2</a><br>" .
        "<afaketag href='http://www.google.com'>Google2</afaketag><br>" .
        "<afaketag>Google4</afaketag><br>" . 
        "<a href='http://www.google.com'><img src='someimage.jpg'></a>";
echo preg_replace("/<\/?a( [^>]*)?>/i", "", $text);

Outputs:

Google1<br>Google2<br><afaketag href='http://www.google.com'>Google2</afaketag><br><afaketag>Google4</afaketag><br><img src='someimage.jpg'>
Chrysus
  • 361
  • 1
  • 6
0

Have a try with:

$str = '<p>paragraph</p><a href="http://www.google.com/" target="_blank" title="<>">google -> foo</a><div>In the div</div>';
// first, extract anchor tag
preg_match("~<a .*?</a>~", $str, $match);
// then strip the HTML tags
echo strip_tags($match[0]),"\n";

output:

google -> foo
Toto
  • 89,455
  • 62
  • 89
  • 125