Regex / DOMDocument - match and replace text not in a link

Question

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:

<p>Match this text and replace it</p>
<p>Don't <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>

Searching for 'match this text' would only replace the first instance and last instance.

[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.

Use DOM [as shown](http://stackoverflow.com/questions/4003031/how-to-replace-text-urls-and-exclude-urls-in-html-tags/4037753#4037753) here and adapt — Gordon, Oct 28 '10 at 16:11
What is your preferred behavior with nested tags within the anchor, like `
This is a link with don't match this text content
`? — István Ujj-Mészáros, Nov 18 '10 at 08:49

score 18 · Accepted Answer · edited May 23 '17 at 12:10

18

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?

I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

Thanks for Gordon and stillstanding for commenting on my other answer.

edited May 23 '17 at 12:10

Community

1
1

answered Nov 17 '10 at 22:43

István Ujj-Mészáros

3,228
1
27
46

+1 for giving DOM a try :) This doesn't consider inline elements inside the `` element's text node though. An XPath of `//text()[not(ancestor::a)]` will only return `DOMText` nodes outside of an `` tree. Actually, I think none of the answers so far take that into account. – Gordon Nov 17 '10 at 22:57
@Gordon Could you please provide a text string for this case? – István Ujj-Mészáros Nov 17 '10 at 23:04
1

@styu `
This is a link with inline content
` - When you iterate over the result of //text you will get all text nodes in the document. You only single out those with a direct parent `` element, but not those with an `` element above that. – Gordon Nov 17 '10 at 23:06
1

@Gordon I have edited my answer according to your suggestion. – István Ujj-Mészáros Nov 17 '10 at 23:56
@styu I like your solution but I've encountered a deal breaker of a problem - any ampersands (whether as &, & or &) cause the function to fail with 'xmlParseEntityRef: no name' ...any ideas on how to fix? Thanks! – BrynJ Jan 06 '11 at 16:39
@BrynJ I have no idea, but maybe [this](http://stackoverflow.com/questions/2261530/fix-malformed-xml-in-php-before-processing-using-domdocument-functions/2267283#2267283) answer and the comments helps. – István Ujj-Mészáros Jan 18 '11 at 19:19
1

@styu I was able to resolve this issue in the end by adding $replaced = str_replace('&','&',$replaced); - this effectively replaced the ampersand with the xml entity – BrynJ Jan 27 '11 at 10:45

score 6 · Answer 2 · answered Nov 11 '10 at 16:20

Try this one:

$dom = new DOMDocument;
$dom->loadHTML($html_content);

function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
  if (!empty($dom->childNodes)) {
    foreach ($dom->childNodes as $node) {
      if ($node instanceof DOMText && 
          !in_array($node->parentNode->nodeName, $excludeParents)) 
      {
        $node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
      } 
      else
      {
        preg_replace_dom($regex, $replacement, $node, $excludeParents);
      }
    }
  }
}

preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));

but this fail when replace hyperlink into it, example, use IT WORKS then echo final output to display at browser, the IT WORKS hyperlink will show as raw plain, not clickable — i need help, Jul 28 '22 at 01:29

bcosca · Answer 3 · 2010-11-17T09:19:12.283

This is the stackless non-recursive approach using pre-order traversal of the DOM tree.

  libxml_use_internal_errors(TRUE);
  $dom=new DOMDocument('1.0','UTF-8');

  $dom->substituteEntities=FALSE;
  $dom->recover=TRUE;
  $dom->strictErrorChecking=FALSE;

  $dom->loadHTMLFile($file);
  $root=$dom->documentElement;
  $node=$root;
  $flag=FALSE;
  for (;;) {
      if (!$flag) {
          if ($node->nodeType==XML_TEXT_NODE &&
              $node->parentNode->tagName!='a') {
              $node->nodeValue=preg_replace(
                  '/match this text/is',
                  $replacement, $node->nodeValue
              );
          }
          if ($node->firstChild) {
              $node=$node->firstChild;
              continue;
          }
     }
     if ($node->isSameNode($root)) break;
     if ($flag=$node->nextSibling)
          $node=$node->nextSibling;
     else
          $node=$node->parentNode;
 }
 echo $dom->saveHTML();

libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.

score 2 · Answer 4 · answered Nov 16 '10 at 00:45

$a='<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>';

echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);

The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.

score 1 · Answer 5 · edited May 23 '17 at 10:29

1

You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use. Here is the alternative in parallel with Netcoder's DomDocument solution:

function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
    require_once('simple_html_dom.php');
    $html = str_get_html($html_content);
    foreach ($html->find('text') as $element) {
        if (!in_array($element->parent()->tag, $excludedParents))
            $element->innertext = str_ireplace($search, $replace, $element->innertext);
    }
    return (string)$html;
}

I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).

edited May 23 '17 at 10:29

Community

1
1

answered Nov 16 '10 at 08:06

István Ujj-Mészáros

3,228
1
27
46

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Nov 16 '10 at 10:58
@Gordon: I think all of them are builds the DOM by parsing strings (including DOMDocument). The question is how are these doing this (are they mess up the document with unwanted entities for example, or are they just doing their work). And the speed is not a real issue here, because you want only process the document when it gets modified. Anyway, thanks for the suggestions, I will further investigate them. – István Ujj-Mészáros Nov 16 '10 at 13:06
@styu all of these are based on DOM and DOM uses libxml. – Gordon Nov 16 '10 at 13:22
@Gordon Maybe there is a bug in libxml, but if all of them using DOM, then all of them has the same issues (they are just different wrappers for the same library). phpQuery and Zend_Dom works fine without the DocType declaration, but none of them can handle UTF-8 encoding. They are transforming ÁÍŰŐ into ÃÃÅ°Å or ÃÃÅ°Å If you know a proper solution with DOM, please describe it, and I will happily use it. – István Ujj-Mészáros Nov 16 '10 at 18:08
@styu DOM works fine with UTF-8 and does not transform anything unless you tell it to. If you need help using DOM, feel free to make it into a question and I might be inclined to answer it. [Some of my many previous answers on DOM usage might help you too, too](http://stackoverflow.com/search?q=user%3A208809+dom), as might [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) – Gordon Nov 16 '10 at 18:18
@styu: see how my solution handles utf-8 – bcosca Nov 17 '10 at 09:24
@stillstanding: Your rev5 version works with [this](http://pastie.org/1305199) HTML code, but rev6 drops a Fatal error: Maximum execution time of 30 seconds exceeded. Is it possible to load only [this](http://pastie.org/1305222) part of the HTML, and save it without the full DOM tree? Simple HTML DOM is doing this without any further configuration (but I am still interested in the DOMDocument solution). – István Ujj-Mészáros Nov 17 '10 at 11:12
There's DOMDocumentFragment for handling partial HTML/XML documents: http://php.net/manual/en/class.domdocumentfragment.php – bcosca Nov 17 '10 at 12:04
@Gordon, @stillstanding: I have just posted an other answer with DomDocument, according to my experiences. Thanks for your comments. – István Ujj-Mészáros Nov 17 '10 at 22:49
@Gordon Please review my other, [DomDocument related answer](http://stackoverflow.com/questions/2735291/php-domdocument-class-unable-access-domnode/4230447#4230447) for a quite old question, where I compared two solution, one with DomDocument and the same with Simple Html DOM Parser. – István Ujj-Mészáros Nov 20 '10 at 00:13

MnomrAKostelAni · Answer 6 · 2010-11-11T12:57:47.577

0

<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>

This way works. Hope you want realy case sensitive, so match with small letter.

edited Nov 11 '10 at 12:57

answered Nov 11 '10 at 10:28

MnomrAKostelAni

458
1
4
13

I'm sorry, but this is not going to work in many cases. Right now, you're looking for "match this text", preceded by any character except `<`, `.`, `*` or `>`... – Tim Pietzcker Nov 11 '10 at 11:07
this code really isn't going to do the job. There are a dozen senarios where this would fail to do it's job. – Caleb Nov 13 '10 at 09:52

Nathan MacInnes · Answer 7 · 2010-11-16T09:20:36.937

0

HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:

preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");

If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

edited Nov 16 '10 at 09:20

answered Nov 11 '10 at 10:43

Nathan MacInnes

11,033
4
35
50

Huge challenge is a nice way of putting it :) – Tim Pietzcker Nov 11 '10 at 11:12
Bit of an understatement, eh? :) For some things, it's pretty much impossible. This little task is just about manageable though. – Nathan MacInnes Nov 11 '10 at 11:16
Nice try, the "replace back" does avoid several potential pitfalls of this operation, but I think your solution will still fail on nested tags, tags that span multiple lines, and several other scenarios. The only way to do this right is going to be using something that actually parses the DOM. – Caleb Nov 13 '10 at 09:54
@Caleb - agreed. (Although I've added the s modifier to make it work for tags over multiple lines.) I figured it's not all that common to nest tags inside tags. It depends how robust the OP needs it to be based on where it's used. – Nathan MacInnes Nov 16 '10 at 09:23

Regex / DOMDocument - match and replace text not in a link

7 Answers7

Linked

Related