How to get DOMDocument from HTML with cyrillic letters in PHP 8

Question

In PHP 7 I used the following code to get DOM Document object from HTML which may contain cyrillic letters:

$pageHTML = '<!doctype html>
<html>
<head>
</head>
<body>
<div>Текст</div>
</body>
</html>';

$pageHTML = mb_convert_encoding($pageHTML, 'HTML-ENTITIES', 'UTF-8');

$dom = new DOMDocument;
$dom->loadHTML($pageHTML);

echo $dom->getElementsByTagName('div')[0]->textContent;

Now, in PHP 8 it throws the error

mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in ...

What exactly should I use now in PHP 8?

Literally what the message says. But your current code does not convert **to** UTF-8, it converts **from** UTF-8. Is that alright? Why do you need 7-bit ASCII in 2023? — Álvaro González, Feb 02 '23 at 10:03
@ÁlvaroGonzález I just need to get DOMDocument from HTML with cyrillic letters — stckvrw, Feb 02 '23 at 10:10
Apparently, your source is already fully compatible with Unicode (that's what the `U` in UTF-8 stands for). What encoding is _your_ code using? — Álvaro González, Feb 02 '23 at 10:14
UTF-8 is set in my code editor (if I understand you correctly). But I get the output I mentioned in the comment under the answer of @Umair Malik — stckvrw, Feb 02 '23 at 10:22
`ÐÐ°Ð³Ð¾Ð»Ð¾Ð²Ð¾Ðº Body: Ð¢ÐµÐ»Ð¾` is what's called [mojibake](https://en.wikipedia.org/wiki/Mojibake): "garbled text that is the result of text being decoded using an unintended character encoding". You're using UTF-8 (great choice!) but some part of your stack is defaulting to something else. Doing blind conversions can either mask the problem or make it worse, but it isn't a sustainable solution. We have a classic question about that: [UTF-8 all the way through](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through) — Álvaro González, Feb 03 '23 at 07:45
@ÁlvaroGonzález When I print `$pageHTML`, I receive normal cyrillic text. But when I `loadHTML()` to `$dom`, then get a node and print its `nodeValue` (or `textContent`) I receive mojibake — stckvrw, Feb 03 '23 at 09:33
That external HTML might not be using UTF-8, or it can also declare to be using some encoding but actually use a different one. Where do you get it from? — Álvaro González, Feb 03 '23 at 09:38
@ÁlvaroGonzález It's just a PHP file with the code I posted in the question — stckvrw, Feb 03 '23 at 09:50
Ah! I totally forgot something: https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly — Álvaro González, Feb 03 '23 at 09:56
Yes, while you was posted your comment I've found the same here https://stackoverflow.com/a/39148696 It resolved the issue, thanks! — stckvrw, Feb 03 '23 at 10:02

How to get DOMDocument from HTML with cyrillic letters in PHP 8

0 Answers0