3

I use DOMDocument to manipulate html and php 7. The problem is that text shows good on page (cyrillic), but when I go to "See HTML page source", it is not good. It shows like this: Здесь осн

What might be wrong? <meta> charset is utf-8. My code:

$dom = new DOMDocument();
if (@$dom->loadHTML(mb_convert_encoding("<div>$body</div>", 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD)) {

    // https://stackoverflow.com/questions/29493678/loadhtml-libxml-html-noimplied-on-an-html-fragment-generates-incorrect-tags

    $container = $dom->getElementsByTagName('div')->item(0);
    $container = $container->parentNode->removeChild($container);

    while ($dom->firstChild)
        $dom->removeChild($doc->firstChild);

    while ($container->firstChild )
        $dom->appendChild($container->firstChild);

    $xpath = new DOMXPath($dom); 
    $headlines = $xpath->query("//h2");
    // some code..

    return $dom->saveHTML();
}
sirjay
  • 1,767
  • 3
  • 32
  • 52

1 Answers1

8

The problem is with $dom->saveHTML();, you need to add the root node as a parameter, like this:

return $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0));

The suddenly it renders the page differently, with substitution. If it does not, double check the values of $dom->encoding and $dom->substituteEntities, they should read UTF-8 and TRUE.

Code4R7
  • 2,600
  • 1
  • 19
  • 42
  • How did you know that, my friend? I read a lot of sources and nobody wrote about this solution – sirjay Nov 23 '17 at 14:11
  • 1
    From memory, I had the same problem with my own framework years ago. The shorter syntax would be `$dom->saveHTML($dom->documentElement);` – Code4R7 Nov 23 '17 at 14:50
  • @sirjay [Others found the solution](https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) as well. The behaviour is [not document on php.net](http://php.net/manual/en/domdocument.savehtml.php). Also Google has [no results](https://www.google.nl/search?q=%2Blibxml+%2Bsavehtml+-php) about this, so it must be something within the SaveHTML function passing parameters to libxml, I doubt the PHP team knowns this, there is [no bug report](https://bugs.php.net/search.php?search_for=savehtml). It's something users found out for themselves. – Code4R7 Nov 23 '17 at 15:05