0

I am trying to replace every character (including newline, tabs, whitespace etc) between Nodes that has the same tag name. The problem is that the regex matches the different node (string) as one based on similarity between the beginning and closing tags of the nodes and then output a single result.

For Example:

$html_string = "


<div> Below are object Node with the html code </div>

<script> alert('i want this to be replaced. it has no newline'); </script>

<div> I don't want this to be replaced </div>

<script> 
    console.log('i also want this to be replaced. It has newline'); 
</script>

<div> This is a div tag and not a script, so it should not be replaced </div>

<script> console.warn(Finally, this should be replaced, it also has newline'); 
</script>

<div> The above is the final result of the replacements </div> ";


$regex = '/(?:\<script\>)(.*)?(?:\<\/script\>)/ims';
$result = preg_replace($regex, '<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->', $html_string);
echo $result;

Expected Result:

<div> Below are object Node with the html code </div>

<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->

<div> I don't want this to be replaced </div>

<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->

<div> This is a div tag and not a script, so it should not be replaced </div>

<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->

<div> The above is the final result of the replacements </div>

Actual Output:

<div> Below are object Node with the html code </div>

<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->

<div> The above is the final result of the replacements </div>

How can i sort this out. Thanks in advance.

Toto
  • 89,455
  • 62
  • 89
  • 125
Uchenna Ajah
  • 320
  • 3
  • 15
  • Thanks for the downvote. Now can i have an answer. – Uchenna Ajah Apr 06 '19 at 11:07
  • 3
    While I didn't down vote it, REGEX is a poor choice for HTML, HTML is a Hierarchical language (nested tags). Something Regex does not handle well. Consider using DOM (PHP core) or PHPQuery (3rd party Library) Or another DOM parser. – ArtisticPhoenix Apr 06 '19 at 11:11

1 Answers1

3

Using DOMDocument is generally preferable to trying to parse HTML with regex. Based on your question, this will give you the results you want. It finds each script node in the HTML and replaces it with the comment you specified:

$doc = new DOMDocument();
$doc->loadHTML("<html>$html_string</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//script') as $script) {
    $comment = $doc->createComment('THIS SCRIPT CONTENT HERE HAS BEEN ALTERED');
    $script->parentNode->replaceChild($comment, $script);
}
echo substr($doc->saveHTML(), 6, -8);

Note that because you don't have a top-level element in the HTML, one (<html>) has to be added on read and then removed on output (using substr).

Output:

<div> Below are object Node with the html code </div> 
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED--> 
<div> I don't want this to be replaced </div> 
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED--> 
<div> This is a div tag and not a script, so it should not be replaced </div> 
<!--THIS SCRIPT CONTENT HERE HAS BEEN ALTERED--> 
<div> The above is the final result of the replacements </div> 

Demo on 3v4l.org

If you insist on using regex (but you should read this before you do), the problem with your regex lies in this part:

(.*)?

This looks for an optional string of as many characters as possible, leading up to </script>. So it basically absorbs all the characters between the first <script> and the last </script> (because all the characters in </script> match .). What you actually wanted was (.*?) which is non-greedy and so matches only up to the first </script> i.e.

$regex = '/(?:\<script\>)(.*?)(?:\<\/script\>)/ims';
$result = preg_replace($regex, '<!-- THIS SCRIPT CONTENT HERE HAS BEEN ALTERED -->', $html_string);
echo $result;

The output from this is as you require.

Demo on 3v4l.org

Nick
  • 138,499
  • 22
  • 57
  • 95
  • Thank you nick. I'm not used to DOMDocument which I'll begin to make use of hence forth. Though, I have not fully understood the `DOMXPath`. While reading the manual on. https://www.php.net/manual/en/domdocument.construct.php , I see that I can get the element as DOMDocument::getElementsByTagName('script'). At this moment where I'm on the basic of this DOM Parser, can I replace the ` – Uchenna Ajah Apr 06 '19 at 13:26
  • @UchennaAjah you are absolutely right about using `DOMDocument::getElementsByTagName('script')` but the problem with using that is that when you try and use a `foreach` over it, each replacement with a comment causes the iterator to change, and you end up skipping every second script. See https://3v4l.org/Y9rAe – Nick Apr 06 '19 at 21:25
  • OK. I've not tested this though, but I think it can alternatively work out. Say: `$elem = DOMDocument::getElementsByTagName('script'); $length = $elem->length; for($x = 0; $x < length; $x++) { $elem->item($x); }`. Hopefully, I guess that would return all the ` – Uchenna Ajah Apr 06 '19 at 22:02
  • @UchennaAjah unfortunately that doesn't work either... https://3v4l.org/TUMUq. You cannot easily manipulate `innerHTML` and `outerHTML` in `DOM`, but [this question](https://stackoverflow.com/questions/2087103/how-to-get-innerhtml-of-domnode) might give you some ideas – Nick Apr 06 '19 at 22:15
  • @UchennaAjah You can make the `getElementsByTagName` version work by going through the list backwards - that way the changes only affect the DOM *after* the element you replace and so you get to see all the script nodes. https://3v4l.org/P5IEM – Nick Apr 06 '19 at 22:40