I'm having an issue while parsing HTML with PHP's DOMDocument.
The HMTL i'm parsing has the following script tag:
<script type="text/javascript">
var showShareBarUI_params_e81 =
{
buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
}
</script>
This snippet has two problems:
1) The HTML inside the buttonWithCountTemplate
var is not escaped. DOMDocument manages this correctly, escaping the characters when parsing it. Not a problem.
2) Near the end, there's a img tag with an unescaped closing tag:
<img src="$iconImg" />
The />
makes DOMDocument think that the script is finished but it lacks the closing tag. If you extract the script using getElementByTagName you'll get the tag closed at this img tag, and the rest will appear as text on the HTML.
My goal is to remove all scripts in this page, so if I do a removeChild()
over this tag, the tag is removed but the following part appears as text when rendering the page:
</div><div class="sCountBox">$count</div></a></div>',
}
</script>
Fixing the HTML is not a solution because I'm developing a generic parser and needs to handle all types of HTML.
My question is if I should do any sanitization before feeding the HTML to DOMDocument, or if there's an option to enable on DOMDocument to avoid triggering this issue, or even if I can strip all tags before loading the HTML.
Any ideas?
EDIT
After some research, I found out the real problem of the DOMDocument parser. Consider the following HTML:
<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
var test = '</div>';
// I should not appear on the result
</script>
Using the following php code to remove script tags (based on Gholizadeh's answer):
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('js.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//@$dom->loadHTMLFile('script.html'); //fix tags if not exist
while($nodes = $dom->getElementsByTagName("script")) {
if($nodes->length == 0) break;
$script = $nodes->item(0);
$script->parentNode->removeChild($script);
}
//return $dom->saveHTML();
$final = $dom->saveHTML();
echo $final;
The result will be the following:
<div> <!-- Offending div without closing tag -->
<p>';
// I should not appear on the result
</p></div>
The problem is that the first div tag is not closed and seems that DOMDocument takes the div tags inside the JS string as html instead of a simple JS string.
What can I do to solve this? Remember that modifing the HTML is not an option, since I'm developing a generic parser.