When I load html in a DOMDocument it messes the characters.
In my project, the html source is defined by the user and therefore its content might vary greatly.
I'd like to find a secure way of parsing html content from various sources.
By secure I mean mainly
keeping strings consistent with the original
protected from invalid encoding attack
unless you think I should have additional concerns.
nodeValue
does the same as textContent
for this case.
I created this simplified function to clarify the issue:
<?php
function print_content($html)
{
$dom = new DOMDocument();
$dom->loadHTML($html);
$div = $dom->getElementById('cyrillic_bit');
$content = $div->textContent;
print(mb_internal_encoding().' '.$html."\n");
print(mb_detect_encoding($content, 'Windows-1251, UTF-8', true)." ");
print($content."\n");
}
$html = '<div id="cyrillic_bit">Дядо Коледа<br>Error</div>';
print_content($html);
?>
The output is:
UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 ÐÑдо ÐоледаError
I'd like it to be:
UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 Дядо КоледаError