13

I'm having an issue while parsing HTML with PHP's DOMDocument.

The HMTL i'm parsing has the following script tag:

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

This snippet has two problems:

1) The HTML inside the buttonWithCountTemplate var is not escaped. DOMDocument manages this correctly, escaping the characters when parsing it. Not a problem.

2) Near the end, there's a img tag with an unescaped closing tag:

<img src="$iconImg" />

The /> makes DOMDocument think that the script is finished but it lacks the closing tag. If you extract the script using getElementByTagName you'll get the tag closed at this img tag, and the rest will appear as text on the HTML.

My goal is to remove all scripts in this page, so if I do a removeChild() over this tag, the tag is removed but the following part appears as text when rendering the page:

</div><div class="sCountBox">$count</div></a></div>',
        }
    </script>

Fixing the HTML is not a solution because I'm developing a generic parser and needs to handle all types of HTML.

My question is if I should do any sanitization before feeding the HTML to DOMDocument, or if there's an option to enable on DOMDocument to avoid triggering this issue, or even if I can strip all tags before loading the HTML.

Any ideas?


EDIT

After some research, I found out the real problem of the DOMDocument parser. Consider the following HTML:

<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>

Using the following php code to remove script tags (based on Gholizadeh's answer):

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('js.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//@$dom->loadHTMLFile('script.html'); //fix tags if not exist

while($nodes = $dom->getElementsByTagName("script")) {
    if($nodes->length == 0) break;
    $script = $nodes->item(0);
    $script->parentNode->removeChild($script);
}

//return $dom->saveHTML();
$final = $dom->saveHTML();
echo $final;

The result will be the following:

<div> <!-- Offending div without closing tag -->
<p>';
       // I should not appear on the result
</p></div>

The problem is that the first div tag is not closed and seems that DOMDocument takes the div tags inside the JS string as html instead of a simple JS string.

What can I do to solve this? Remember that modifing the HTML is not an option, since I'm developing a generic parser.

Community
  • 1
  • 1
Andres
  • 249
  • 3
  • 16
  • Interesting question. I got tired of advocates against old `` syntax claiming that "all browsers understand JavaScript", which is at most a half true. – Álvaro González Nov 23 '16 at 10:13
  • 1
    Is it really the img element that’s to blame here? My guess would rather be the following ``, because the first occurrence of `` implicitly ends a script element’s content and closes it. // IMHO you can not just let a DOM parser loose on just any broken HTML code, and expect correct results. If you really need to parse messed up HTML like this, you might need to do some “pre-processing” on it before you feed it to a DOM parser - perhaps something like http://htmlpurifier.org/ – CBroe Nov 23 '16 at 10:19
  • Try loading document as XML rather then HTML. [http://stackoverflow.com/questions/19788017/how-to-combine-phps-domdocument-with-a-javascript-template](http://stackoverflow.com/questions/19788017/how-to-combine-phps-domdocument-with-a-javascript-template) – Rafał R Nov 23 '16 at 10:42
  • 1
    @RafałR Using loadXML is not the solution. If your HTML isn't 100% valid, no nodes will be loaded. Try loading my edit and you'll see that the result is empty. – Andres Nov 24 '16 at 19:26

4 Answers4

6

I tested the following code on a html file like this:

<p>some text 1</p>
<img src="http://www.example.com/images/some_image_1.jpg">
<p>some text 2</p>
<p>some text 3</p>
<img src="http://www.example.com/images/some_image_2.jpg">

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

<p>some text 4</p>
<p>some text 5</p>
<img src="http://www.example.com/images/some_image_3.jpg">

the php code is:

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTML(file_get_contents('script.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    //@$dom->loadHTMLFile('script.html'); //fix tags if not exist 

    $nodes = $dom->getElementsByTagName("script");

    foreach($nodes as $i => $node){
        $script = $nodes->item($i);
        $script->parentNode->removeChild($script);
    }

    //return $dom->saveHTML();
    $dom->saveHtmlFile('script.html');

and it works on the given example I think you should use options I used in loading html code.

Edited according to last question updates:

Actually You can't parse [X]HTML with regex (read this link for more information) but if your only purpose is to remove just script tags and you can make sure there is no </script> tag as a string between it. you can use this regex:

$html = mb_convert_encoding(file_get_contents('script2.html'), 'HTML-ENTITIES', 'UTF-8');
$new_html = preg_replace('/<script(.*?)>(.*?)<\/script>/si', '', $html);
file_put_contents('script-result.html', $new_html);

frankly the problem is you may have not a standard HTML code. but I think it's better to try other libraries linked here.

otherwise I guess you should write a special parser to remove script tag and take care of single quote and double quotes inside.

Community
  • 1
  • 1
Saeed.Gh
  • 1,285
  • 10
  • 22
  • Please check the edit, I found out the real issue and your solution doesn't apply anymore. Thanks! – Andres Nov 24 '16 at 19:17
3

i am offering different aproach to your problem:

My goal is to remove all scripts in this page

then you can remove them with preg_replace_callback function and parse the html as DOM after that. Here is working demo: demo

$htmlWithScript = "<html><body><div>something></div><script type=\"text/javascript\">
var showShareBarUI_params_e81 =
{
    buttonWithCountTemplate: '<div class=\"sBtnWrap\"><a href=\"#\" onclick=\"\$onClick\"><div class=\"sBtn\">\$text<img src=\"\$iconImg\" /></div><div class=\"sCountBox\">\$count</div></a></div>',
}
</script></body></html>";



$htmlWithoutScript = preg_replace_callback('~<script.*>.*</script>~Uis', function($matches){
return '';
}, $htmlWithScript);

EDIT

But how do I do this without summoning Cthulhu?

nice comment, but i don't know what you are asking :) If it is loading the html, then you can load html with file_get_contents()

If you do not understand how it will remove tags: preg_replace_callback allows you to search matches against regexp and transform them. In this situation remove them (return '';) Regexp is looking for starting tag of with any attributes (.*) and any content between ending tag

Modificators:

U -> means ungreedy (shortest match possible)

i -> case insensitive ( will be matched as well)

s -> whitespace is included in . (dot) characted (newline will not break match)

I hope this clarifies it a bit..

Jimmmy
  • 579
  • 12
  • 26
  • 1
    But how do I do this without summoning Cthulhu? – Andres Nov 24 '16 at 18:09
  • This happens on the back-end. You're not executing anything. At best, you're looking at Cthulhu and removing the teeth. – smcjones Nov 27 '16 at 18:27
  • @Jimmmy http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Tschallacka Nov 29 '16 at 12:03
  • 1
    @Tschallacka this is useful but not necessarily relevant. No one is suggesting removing the parser. Instead, this is supplementing the parser with a limited regex. The link above deals with using regex AS a parser, which is different. The other answers seem to reflect this confusion as many say it is possible to build a parser with help from regex (compare to making a regex to parse) – smcjones Nov 29 '16 at 16:08
2

Have you tried setting libxml to use internal errors?

$use_errors = libxml_use_internal_errors(true);
// your parsing code here
libxml_clear_errors();
libxml_use_internal_errors($use_errors);

It might allow dom document to continue parsing(maybe).

Tschallacka
  • 27,901
  • 14
  • 88
  • 133
  • 2
    This is not different from `@`. It won't prevent HTML from being incorrectly parsed. – Álvaro González Nov 23 '16 at 10:16
  • True, but i've had cases where domdocument could parse when internal errors was enabled. that's why i'm putting it up here. – Tschallacka Nov 23 '16 at 10:23
  • As @Tschallacka said, that just hides the errors, the parsing problems are still there. – Andres Nov 23 '16 at 15:48
  • @ÁlvaroGonzález Using `libxml_use_internal_errors` **is** different from `@` in that it will only suppress errors stemming from the underlying libxml instead of *any* error. However, it doesn't help with the OP's problem. – Gordon Nov 29 '16 at 10:28
  • @Gordon Certainly, I only meant that it was merely a way to hide error messages (though `libxml_use_internal_errors()` does not really suppress them). – Álvaro González Nov 29 '16 at 11:01
1

Parsing html documents is mostly about its content and not scripts. Espacially using those script without knowing its behaviour and origin might be dangerous and risky.

So when it comes to html content you can ommit scripts with this approach (which I've already pointed in comment): How to combine PHP's DOMDocument with a JavaScript template

To be specific with your example:

<?php
$html = <<<END
<!DOCTYPE html>
<html><body><h1>Hey now</h1>
<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="onClick"><div class="sBtn">text<img src="iconImg" /></div><div class="sCountBox">count</div></a></div>'
    }
</script>
</body></html>
END;

$dom = new DOMDocument();
$dom->preserveWhiteSpace = true; // needs to be before loading, to have any effect
$dom->loadXML($html);
    while (($r = $dom->getElementsByTagName("script")) && $r->length) {
        $r->item(0)->parentNode->removeChild($r->item(0));
    }
$dom->formatOutput = false;
print $dom->saveHTML();

//Outputs
//<!DOCTYPE html><html><head></head><body><h1>Hey now</h1></body></html>

You can also try using some regular expressions to remove script tags before loading to DOMDocument or check other html parsing libraries. Finally you have to realize that in some cases even perfect expression will break and DOMDocument parser is not as good as true browser engine. Everything comes to purpose of your parsing and finding best solutions for it.

PHP Simple HTML DOM Parser Example:

http://simplehtmldom.sourceforge.net/manual.htm

require_once 'libs/simplehtmldom_1_5/simple_html_dom.php';
$html = <<<END
<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>
END;

$dom = str_get_html($html);
echo $dom;

//outputs with no error or warnings
//<div> <!-- Offending div without closing tag --><script type="text/javascript">var test = '</div>';// I should not appear on the result  </script>
Community
  • 1
  • 1
Rafał R
  • 241
  • 2
  • 10
  • Using loadXML is not the solution. If your HTML isn't 100% valid, no nodes will be loaded. Try loading my edit and you'll see that the result is empty. – Andres Nov 24 '16 at 19:26
  • I tried PHP Simple HTML DOM Parser under link [link]http://simplehtmldom.sourceforge.net/[/link] and the output looks good. – Rafał R Nov 24 '16 at 21:41
  • And its really easy to use `code` require_once 'libs/simplehtmldom_1_5/simple_html_dom.php'; $html = << END; $dom = str_get_html($html); echo $dom; – Rafał R Nov 24 '16 at 21:42
  • Sorry for terrible comment formatting. I was doing it for the first time :) Proper [link]http://simplehtmldom.sourceforge.net/ – Rafał R Nov 24 '16 at 21:46
  • 1
    XHTML was supposed to be XML (in practice, nobody cared). Regular HTML is not XML, even if it's perfectly valid HTML. – Álvaro González Nov 25 '16 at 09:35