0

In PHP 7 I used the following code to get DOM Document object from HTML which may contain cyrillic letters:

$pageHTML = '<!doctype html>
<html>
<head>
</head>
<body>
<div>Текст</div>
</body>
</html>';

$pageHTML = mb_convert_encoding($pageHTML, 'HTML-ENTITIES', 'UTF-8');

$dom = new DOMDocument;
$dom->loadHTML($pageHTML);

echo $dom->getElementsByTagName('div')[0]->textContent;

Now, in PHP 8 it throws the error

mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in ...

What exactly should I use now in PHP 8?

stckvrw
  • 1,689
  • 18
  • 42
  • Literally what the message says. But your current code does not convert **to** UTF-8, it converts **from** UTF-8. Is that alright? Why do you need 7-bit ASCII in 2023? – Álvaro González Feb 02 '23 at 10:03
  • @ÁlvaroGonzález I just need to get DOMDocument from HTML with cyrillic letters – stckvrw Feb 02 '23 at 10:10
  • Apparently, your source is already fully compatible with Unicode (that's what the `U` in UTF-8 stands for). What encoding is _your_ code using? – Álvaro González Feb 02 '23 at 10:14
  • UTF-8 is set in my code editor (if I understand you correctly). But I get the output I mentioned in the comment under the answer of @Umair Malik – stckvrw Feb 02 '23 at 10:22
  • `Ðаголовок Body: Тело` is what's called [mojibake](https://en.wikipedia.org/wiki/Mojibake): "garbled text that is the result of text being decoded using an unintended character encoding". You're using UTF-8 (great choice!) but some part of your stack is defaulting to something else. Doing blind conversions can either mask the problem or make it worse, but it isn't a sustainable solution. We have a classic question about that: [UTF-8 all the way through](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through) – Álvaro González Feb 03 '23 at 07:45
  • @ÁlvaroGonzález When I print `$pageHTML`, I receive normal cyrillic text. But when I `loadHTML()` to `$dom`, then get a node and print its `nodeValue` (or `textContent`) I receive mojibake – stckvrw Feb 03 '23 at 09:33
  • That external HTML might not be using UTF-8, or it can also declare to be using some encoding but actually use a different one. Where do you get it from? – Álvaro González Feb 03 '23 at 09:38
  • @ÁlvaroGonzález It's just a PHP file with the code I posted in the question – stckvrw Feb 03 '23 at 09:50
  • 1
    Ah! I totally forgot something: https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly – Álvaro González Feb 03 '23 at 09:56
  • 1
    Yes, while you was posted your comment I've found the same here https://stackoverflow.com/a/39148696 It resolved the issue, thanks! – stckvrw Feb 03 '23 at 10:02

0 Answers0