0

I m trying to remove a specific example domain from the content but want to retain other domains using preg_replace.

Some text here and here and then this <a href="http://www.example.com/s/product/B0057WCGJ4/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeA=B0057WCGJ4&linkCode=as2&id=12&linkId=NZMCDXCODEBYMK3H">Example Anchor Text</a> and so on now the next html <a href="other.com">Other domain to be retained</a>.

I want to remove the HTML tags and the url of a specific domain here in this case the example.com including the subdomain. But retain other domains.

I tried achieving it from the following code.. But the problem is when it encounters special characters in the url.. it doesn't seem to work.. Without special characters it works well.

HERE is the code: I tried.

$txt=preg_replace('/<a href=\"(.*?)example(.*?)\">(.*?)<\/a>/', "\\3", $txt);

Works fine for simple urls - not for urls with special characters like #. any help guys.. new to this, have to learn am not gonna give up..

Appreciate your help.

chris85
  • 23,846
  • 7
  • 34
  • 51
Karun
  • 31
  • 3
  • Possible duplicate of [using preg\_replace to remove html tag](http://stackoverflow.com/questions/20208582/using-preg-replace-to-remove-html-tag) –  Dec 29 '15 at 18:20
  • You could use the domdocument pretty easily, and much more reliably. – chris85 Dec 29 '15 at 18:38
  • If u just want to remove html use strip tags or u just to rwmove example.com – devpro Dec 29 '15 at 19:03
  • Hi robot joe,, !! how you find duplicate question? is there any tool? or you go in google to find the answer of this question? – Ypages Onine Dec 29 '15 at 19:54

1 Answers1

1

You should consider using DOMDocument for manipulating HTML, because regular expressions will become incredibly complex if they have to deal with every possible situation.

Still, I will provide here a regular expression that improves on the following points:

  • allows any white space to occur at several positions in the anchor element.
  • allows other attributes to appear in the anchor, before and after href
  • allows upper/lower case (using the i modifier)
  • allows the url to be wrapped in single quotes instead of double
  • does not count it a match when "example" occurs after a ? or # in a URL
  • allows line feeds in the anchor text (using the s modifier)
  • requires "example" to be surrounded by dots.

Here it is:

$txt = preg_replace(
'/<a\s+(?:[^>]+\s+)*?href\s*=\s*["\'][^"\'#?]*?\.example\..*?[\"\']\s*>(.*?)<\/a\s*>/si',
"\\1", $txt);

But it has limits. For instance, if a URL for some reason would contain a quote, it would fail.

DOMDocument solution

Here is how to properly do such things. The code is longer, but will give more reliable results:

// function to remove links when URL address has given pattern
function DOMremoveLinks($dom, $url_pattern) {
    foreach ($dom->getElementsByTagName('a') as $a) {
        $href = $a->getAttribute('href');
        if (preg_match($url_pattern, $href)) {
            $span = $dom->createElement("span", $a->textContent);
            $a->parentNode->replaceChild($span, $a);
        }
    }
}

// utility function to get innerHTML of an element
function DOMinnerHTML($element) { 
    $innerHTML = ""; 
    foreach ($element->childNodes as $child) { 
        $innerHTML .= $element->ownerDocument->saveHTML($child);
    }
    return $innerHTML; 
}

// test data
$html = '<a name="this" 
  href = \'http://www.example.com/s/product/B00GJ4/ref=as_li_tl?ie=UTF8\'
        target="_blank" >
Hello</a>
<a href="fdf.abc.com?example=fsdf">World</a>';


// create DOM for given HTML
$dom = new DOMDocument();
// ignore warnings
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);

// call our function to make the replacement(s)
DOMremoveLinks($dom, "/^[^#?]*\.example\./");

// convert back to HTML
$html = DOMinnerHTML($dom->getElementsByTagName('body')->item(0));
trincot
  • 317,000
  • 35
  • 244
  • 286