Search and replace a string of HTML using the PHP DOM Parser

Question

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?

For example, search for

<p> <a href="google.com"> Check this site </a> </p>

This string is somewhere inside inside an html tree.

I would like to find it and replace it with another string. For example,

<span class="highligher"><p> <a href="google.com"> Check this site </a> </p></span>

Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.

I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.

EDIT:

The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.

Thanks!

Aren't you better off using JavaScript and adding an id / class to `
`? — Audite Marlow, Nov 11 '15 at 10:48
I have no control over the HTML document being parsed, so I cannot add any attributes. I read about Simple HTML DOM, however people say it is inferior to the native PHP DOM Parser — user3857924, Nov 11 '15 at 10:53
`getElementsByTagName(..)`, then filter with `getAttribute(..)` on them? — Sumurai8, Nov 11 '15 at 10:56
This can return 20+ different
elements, how do you identify the right one and replace it ? — user3857924, Nov 11 '15 at 11:06
Possible duplicate of [RegEx with preg\_match to find and replace a SIMILAR string](http://stackoverflow.com/questions/33671497/regex-with-preg-match-to-find-and-replace-a-similar-string) — Madivad, Nov 12 '15 at 14:56

score 4 · Answer 1 · answered Nov 11 '15 at 13:35

Is this what you wanted:

<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");

// search p elements
$p_elements = $doc->getElementsByTagName('p');

// parse this elements, if available
if (!is_null($p_elements)) 
{
    foreach ($p_elements as $p_element) 
    {
        // get p element nodes
        $nodes = $p_element->childNodes;

        // check for "a" nodes in these nodes
        foreach ($nodes as $node) {

            // found an a node - check must be defined better!
            if(strtolower($node->nodeName) === 'a')
            {
                // create the new span element
                $span_element = $doc->createElement('span');
                $span_element->setAttribute('class', 'highlighter');

                // replace the "p" element with the span
                $p_element->parentNode->replaceChild($span_element, $p_element);
                // append the "p" element to the span
                $span_element->appendChild($p_element);
            }
        }
    }
}

// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';

This HTML is the basis for conversion:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr><a href="http://somegreatsite.com">Link Name</a>
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> <a href="amazon.com"> Check this site </a> </p>
Send me mail at <a href="mailto:support@yourcompany.com">
support@yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> <a href="google.com"> Check this site </a> </p>
</body></html>

The output looks like that, it wraps the elements you mentioned:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr><a href="http://somegreatsite.com">Link Name</a>
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> <a href="amazon.com"> Check this site </a> </p></span>
Send me mail at <a href="mailto:support@yourcompany.com">
support@yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> <a href="google.com"> Check this site </a> </p></span>
</body></html>

Thanks, however that will not be able to identify the exact node which contains the "check this site" text. Seems it will pick up the first
containing element. There might be 20 other strings that meet this criteria but have different text inside. Additionally, the html to be replaced is dynamic. It might contain DIVs, bolds, header tags etc.. — user3857924, Nov 11 '15 at 16:11
You can use `trim($node->textContent) === 'Check this site'` for checking for specific content. What do you mean by "the html to be replaced is dynamic"? Can you give more examples, I thought you wanted to wrap an
element with an element inside with the text "check this site" with an element. — Philipp Palmtag, Nov 12 '15 at 07:16

Richard Merchant · Answer 2 · 2015-11-11T12:30:01.177

0

You could use a regular expression with preg_replace.

 preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p><a href="google.com"> Check this site</a></p>');

The third parameter of preg_replace can be used to restrict the number of replacements

http://php.net/manual/en/function.preg-replace.php http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML

You will need to edit the regular expression to only capture the p tags with the google href

EDIT

preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p><a href="$1$2">$3</a></p></span>', $string);

edited Nov 11 '15 at 12:30

answered Nov 11 '15 at 11:06

Richard Merchant

983
12
10

Thanks, seems I will have to use regular expressions. However, the strings being searched and replaced can vary. It might be

Check this site
. So, I am looking for a more universal solution. May be a dynamic expression to handle all cases? – user3857924 Nov 11 '15 at 11:11
Also, does that mean that using DOM parser for this task is not possible? It must be possible to load some html string and search it in the already parsed file ? – user3857924 Nov 11 '15 at 11:26
I'm not to familiar with the DOM parser but I think it will be difficult if there's no class or id – Richard Merchant Nov 11 '15 at 12:27
If you down vote can you at least leave a comment, its pretty pointless down voting without an explanation isn't it – Richard Merchant Nov 16 '15 at 17:54

Search and replace a string of HTML using the PHP DOM Parser

2 Answers2