How to safely read html with DOMDocument (without characters being messed up)?

Question

When I load html in a DOMDocument it messes the characters.

In my project, the html source is defined by the user and therefore its content might vary greatly.

I'd like to find a secure way of parsing html content from various sources.

By secure I mean mainly

keeping strings consistent with the original
protected from invalid encoding attack

unless you think I should have additional concerns.

nodeValue does the same as textContent for this case.

I created this simplified function to clarify the issue:

<?php

function print_content($html)
{
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $div = $dom->getElementById('cyrillic_bit');
    $content = $div->textContent;

    print(mb_internal_encoding().' '.$html."\n");
    print(mb_detect_encoding($content, 'Windows-1251, UTF-8', true)." ");
    print($content."\n");
}

$html = '<div id="cyrillic_bit">Дядо Коледа<br>Error</div>';
print_content($html);

?>

The output is:

UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 ÐÑÐ´Ð¾ ÐÐ¾Ð»ÐµÐ´Ð°Error

I'd like it to be:

UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 Дядо КоледаError

$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); — Stoyan Georgiev, Oct 06 '19 at 14:38
@ChrisDekker my question is not database related, but the question you suggested made me look for `utf8mb4` encoding and the answer of [PHP DOMDocument loadHTML not encoding UTF-8 correctly](https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) works for me. — Stoyan Georgiev, Oct 06 '19 at 14:46
@ChrisDekker the duplicate you flagged was about databases and would provide no help for XML processing; note that if you make an incorrect flag you can always retract it. — miken32, Oct 09 '19 at 13:15

How to safely read html with DOMDocument (without characters being messed up)?

0 Answers0