-1

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?

For example, search for

<p> <a href="google.com"> Check this site </a> </p>

This string is somewhere inside inside an html tree.

I would like to find it and replace it with another string. For example,

<span class="highligher"><p> <a href="google.com"> Check this site </a> </p></span>

Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.

I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.

EDIT:

The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.

Thanks!

user3857924
  • 86
  • 3
  • 15
  • Aren't you better off using JavaScript and adding an id / class to `

    `?

    – Audite Marlow Nov 11 '15 at 10:48
  • Have you tried: http://simplehtmldom.sourceforge.net/ – skywalker Nov 11 '15 at 10:51
  • I have no control over the HTML document being parsed, so I cannot add any attributes. I read about Simple HTML DOM, however people say it is inferior to the native PHP DOM Parser – user3857924 Nov 11 '15 at 10:53
  • `getElementsByTagName(..)`, then filter with `getAttribute(..)` on them? – Sumurai8 Nov 11 '15 at 10:56
  • This can return 20+ different

    elements, how do you identify the right one and replace it ?

    – user3857924 Nov 11 '15 at 11:06
  • Possible duplicate of [RegEx with preg\_match to find and replace a SIMILAR string](http://stackoverflow.com/questions/33671497/regex-with-preg-match-to-find-and-replace-a-similar-string) – Madivad Nov 12 '15 at 14:56

2 Answers2

4

Is this what you wanted:

<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");

// search p elements
$p_elements = $doc->getElementsByTagName('p');

// parse this elements, if available
if (!is_null($p_elements)) 
{
    foreach ($p_elements as $p_element) 
    {
        // get p element nodes
        $nodes = $p_element->childNodes;

        // check for "a" nodes in these nodes
        foreach ($nodes as $node) {

            // found an a node - check must be defined better!
            if(strtolower($node->nodeName) === 'a')
            {
                // create the new span element
                $span_element = $doc->createElement('span');
                $span_element->setAttribute('class', 'highlighter');

                // replace the "p" element with the span
                $p_element->parentNode->replaceChild($span_element, $p_element);
                // append the "p" element to the span
                $span_element->appendChild($p_element);
            }
        }
    }
}

// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';

This HTML is the basis for conversion:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr><a href="http://somegreatsite.com">Link Name</a>
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> <a href="amazon.com"> Check this site </a> </p>
Send me mail at <a href="mailto:support@yourcompany.com">
support@yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> <a href="google.com"> Check this site </a> </p>
</body></html>

The output looks like that, it wraps the elements you mentioned:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr><a href="http://somegreatsite.com">Link Name</a>
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> <a href="amazon.com"> Check this site </a> </p></span>
Send me mail at <a href="mailto:support@yourcompany.com">
support@yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> <a href="google.com"> Check this site </a> </p></span>
</body></html>
Philipp Palmtag
  • 1,310
  • 2
  • 16
  • 18
0

You could use a regular expression with preg_replace.

 preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p><a href="google.com"> Check this site</a></p>');

The third parameter of preg_replace can be used to restrict the number of replacements

http://php.net/manual/en/function.preg-replace.php http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML

You will need to edit the regular expression to only capture the p tags with the google href

EDIT

preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p><a href="$1$2">$3</a></p></span>', $string);
Richard Merchant
  • 983
  • 12
  • 10
  • Thanks, seems I will have to use regular expressions. However, the strings being searched and replaced can vary. It might be . So, I am looking for a more universal solution. May be a dynamic expression to handle all cases? – user3857924 Nov 11 '15 at 11:11
  • Also, does that mean that using DOM parser for this task is not possible? It must be possible to load some html string and search it in the already parsed file ? – user3857924 Nov 11 '15 at 11:26
  • I'm not to familiar with the DOM parser but I think it will be difficult if there's no class or id – Richard Merchant Nov 11 '15 at 12:27
  • If you down vote can you at least leave a comment, its pretty pointless down voting without an explanation isn't it – Richard Merchant Nov 16 '15 at 17:54