1

When I load html in a DOMDocument it messes the characters.

In my project, the html source is defined by the user and therefore its content might vary greatly.

I'd like to find a secure way of parsing html content from various sources.

By secure I mean mainly

  • keeping strings consistent with the original

  • protected from invalid encoding attack

unless you think I should have additional concerns.

nodeValue does the same as textContent for this case.

I created this simplified function to clarify the issue:

<?php

function print_content($html)
{
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $div = $dom->getElementById('cyrillic_bit');
    $content = $div->textContent;

    print(mb_internal_encoding().' '.$html."\n");
    print(mb_detect_encoding($content, 'Windows-1251, UTF-8', true)." ");
    print($content."\n");
}

$html = '<div id="cyrillic_bit">Дядо Коледа<br>Error</div>';
print_content($html);

?>

The output is:

UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 ÐÑдо ÐоледаError

I'd like it to be:

UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 Дядо КоледаError
Stoyan Georgiev
  • 185
  • 3
  • 13
  • $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); – Stoyan Georgiev Oct 06 '19 at 14:38
  • @ChrisDekker my question is not database related, but the question you suggested made me look for `utf8mb4` encoding and the answer of [PHP DOMDocument loadHTML not encoding UTF-8 correctly](https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) works for me. – Stoyan Georgiev Oct 06 '19 at 14:46
  • @ChrisDekker the duplicate you flagged was about databases and would provide no help for XML processing; note that if you make an incorrect flag you can always retract it. – miken32 Oct 09 '19 at 13:15

0 Answers0