4

I try to load an HTML page from a remote server into a PHP script, which should manipulate the HTML with the DOMDocument class. But I have seen, that the DOMDocument class removes some parts of the Javascript, which comes with the HTML page. There are some things like:

<script type="text/javascript">
//...
function printJSPage() {
    var printwin=window.open('','haha','top=100,left=100,width=800,height=600');
    printwin.document.writeln(' <table border="0" cellspacing="5" cellpadding="0" width="100%">');
    printwin.document.writeln(' <tr>');
    printwin.document.writeln(' <td align="left" valign="bottom">');
    //...
    printwin.document.writeln('</td>');
    //...
}
</script>

But the DOMDocument changes i.e. the line

printwin.document.writeln('</td>');

to

printwin.document.writeln(' ');

and also a lot of others things (i.e. the last script tag is no longer there. As the result I get a complete destroyed page, which I cannot send further.

So I think, DOMDocument has problems with the HTML tags within the Javascript code and tries to correct the code, to produce a well-formed document. Can I prevent the Javascript parsing within DOMDocument?

The PHP code fragment is:

$stdin = file_get_contents('php://stdin');
$dom = new \DOMDocument();
@$dom->loadHTML($stdin);
return $dom->saveHTML();   // will produce wrong HTML
//return $stdin;           // will produce correct HTML

I have stored both HTML versions and have compared both with Meld.

I also have tested

@$dom->loadXML($stdin);
return $dom->saveHTML();

but I don't get any things back from the object.

witchi
  • 345
  • 3
  • 16
  • Can reproduce https://3v4l.org/O0iEf – Gordon Aug 17 '18 at 09:44
  • Possible duplicate of [DOMDocument removes HTML tags in JavaScript string](https://stackoverflow.com/questions/24575136/domdocument-removes-html-tags-in-javascript-string) – iainn Aug 17 '18 at 09:48
  • Initially I thought this was a duplicate of https://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-style-in-script-tag but there doesn't seem to be a sensible solution to this problem in that question so I'm going to go with DomDocument can't deal with script tags properly, which when saying it out loud sounds ridiculous. I've even tried wrapping the script contents in `<![CDATA[...]]>` but that still does not work – apokryfos Aug 17 '18 at 10:02
  • I also cannot wrap the script tags nor do anything on the page before parsing. I get the page from an external system (I don't have access there), but I need some PHP to post-processing the pages before they will be delivered to the browser. – witchi Aug 17 '18 at 11:40
  • Maybe it is not a problem of DOMDocument, but of the underlying libxml2. I have tested my page with `xmllint --html --htmlout /tmp/mypage.html` and I get a lot of parser errors, exactly on the positions where DOMDocument removes tags. – witchi Aug 17 '18 at 12:17
  • I have looked into the libxml code and I found a possible solution: the recover mode. The method htmlParseScript() offers this way and `xmllint --html --htmlout --recover /tmp/mypage.html` returns now the last TD tag of the example. The equivalent on PHP DOMDocument->recover=TRUE doesn't work, also I cannot find a matching option for loadHTML(). Is the source of DOMDocument available? – witchi Aug 20 '18 at 08:52

2 Answers2

2

Here's a hack that might be helpful. The idea is to replace the script contents with a string that's guaranteed to be valid HTML and unique then replace it back.

It replaces all contents inside script tags with the MD5 of those contents and then replaces them back.

$scriptContainer = [];
$str = preg_replace_callback ("#<script([^>]*)>(.*?)</script>#s", function ($matches) use (&$scriptContainer) {
     $scriptContainer[md5($matches[2])] = $matches[2];
        return "<script".$matches[1].">".md5($matches[2])."</script>";
    }, $str);
$dom = new \DOMDocument();
@$dom->loadHTML($str);
$final = strtr($dom->saveHTML(), $scriptContainer); 

Here strtr is just convenient due to the way the array is formatted, using str_replace(array_keys($scriptContainer), $scriptContainer, $dom->saveHTML()) would also work.

I find it very suprising that PHP does not properly parse HTML content. It seems to instead be parsing XML content (wrongly so as well because CDATA content is parsed instead of being treated literally). However it is what it is and if you want a real document parser then you should probably look into a Node.js solution with jsdom

apokryfos
  • 38,771
  • 9
  • 70
  • 114
0

If you have a <script> within a <script>, the following (not so smart) solution will handle that. There is still a problem: if the <script> tags are not balanced, the solution will not work. This could occur, if your Javascript uses String.fromCharCode to print the String </script>.

$scriptContainer = array();

function getPosition($tag) {
    return $tag[0][1];
}

function getContent($tag) {
    return $tag[0][0];
}

function isStart($tag) {
    $x = getContent($tag);
    return ($x[0].$x[1] === "<s");
}

function isEnd($tag) {
    $x = getContent($tag);
    return ($x[0].$x[1] === "</");
}

function mask($str, $scripts) {
    global $scriptContainer;

    $res = "";
    $start = null;
    $stop = null;
    $idx = 0;

    $count = 0;
    foreach ($scripts as $tag) {

            if (isStart($tag)) {
                    $count++;
                    $start = ($start === null) ? $tag : $start;
            }

            if (isEnd($tag)) {
                    $count--;
                    $stop = ($count == 0) ? $tag : $stop;
            }

            if ($start !== null && $stop !== null) {
                    $res .= substr($str, $idx, getPosition($start) - $idx);
                    $res .= getContent($start);
                    $code = substr($str, getPosition($start) + strlen(getContent($start)), getPosition($stop) - getPosition($start) - strlen(getContent($start)));
                    $hash = md5($code);
                    $res .= $hash;
                    $res .= getContent($stop);

                    $scriptContainer[$hash] = $code;

                    $idx = getPosition($stop) + strlen(getContent($stop));
                    $start = null;
                    $stop = null;
            }
    }

    $res .= substr($str, $idx);
    return $res;
}

preg_match_all("#\<script[^\>]*\>|\<\/script\>#s", $html, $scripts, PREG_OFFSET_CAPTURE|PREG_SET_ORDER);
$html = mask($html, $scripts);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_use_internal_errors(false);

// handle some things within DOM

echo strtr($dom->saveHTML(), $scriptContainer);

If you replace the "script" String within the preg_match_all with "style" you can also mask the CSS styles, which can contain tag names too (i.e. within comments).

witchi
  • 345
  • 3
  • 16