0

I am trying to print all the <p> elements of a particular HTML document fetched from a URL. The HTML document is using UTF-8 encoding.

This is my code:

<?php
    error_reporting(E_ALL);
    ini_set('display_errors', 1);
    header('Content-Type: text/plain; charset=utf-8');
    header('Access-Control-Allow-Origin: *');
    header('Access-Control-Allow-Methods: POST, GET, OPTIONS');

    $url = "https://www.sangbadpratidin.in/kolkata/ispat-express-met-an-accident-near-howrah-junction/#.Y7qC6YFeT80.whatsapp";

    $user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"; 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $html=curl_exec($ch);

    if (!curl_errno($ch)) {
        $resultStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        if ($resultStatus == 200) {
            @$DOM = new DOMDocument;
            @$DOM->loadHTML($html);
            
            $bodies = $DOM->getElementsByTagName('p');
            foreach($bodies as $body){
                $para = $body->nodeValue;
                echo $para;
            }
        }
    }
?>

The HTML document is filled with Bengali characters, when I try to print the values, this is what gets printed:

সà§à¦¬à§à¦°à¦¤ বিশà§à¦¬à¦¾à¦¸: ফà§à¦° দà§à¦°à§à¦à¦à¦¨à¦¾à¦° à¦à¦¬à¦²à§ দà§à...

Why am I not getting the original text? Please help me

  • 2
    _"Why am I not getting the original text?"_ - well you kinda _are_ ... But you are not _interpreting_ it in the correct character encoding. So go figure out which encoding that page uses, and then convert the received data into the encoding your system is using, if necessary. – CBroe Feb 01 '23 at 08:00
  • 1
    It's using `UTF-8` encoding, and so am I `header('Content-Type: text/plain; charset=utf-8');` –  Feb 01 '23 at 08:05
  • 1
    _"It's using UTF-8 encoding"_ - no, apparently it isn't. I am guessing this is supposed to look something like `쎠슦슸쎠슧쎠슦슬...`? Yeah, that what I get when I convert this from UTF-16 to UTF-8. – CBroe Feb 01 '23 at 08:19
  • No it's a bengali text, it looks like this: `ফের দুর্ঘটনার কবলে দূরপাল্লার...`. And yes it is **UTF-8**, I got it when I ran `document.characterSet` in the console of that page –  Feb 01 '23 at 08:22
  • Give us the URL then please, so we can check for ourselves. – CBroe Feb 01 '23 at 08:24
  • https://www.sangbadpratidin.in/kolkata/ispat-express-met-an-accident-near-howrah-junction/#.Y7qC6YFeT80.whatsapp –  Feb 01 '23 at 08:26
  • Then it is probably due to the content encoding, `content-encoding: br`. https://stackoverflow.com/q/51345991/1427878 – CBroe Feb 01 '23 at 08:31
  • @CBroe actually he isn't getting the original text, DOMDocument corrupts it under the assumption that it's windows-1252 encoded text. see my answer below. – hanshenrik Feb 01 '23 at 09:19

2 Answers2

1

edit: i just TESTED it, yeah this fixed it :) see it live at https://dh.ratma.net/test/test2.php

known issue with DOMDocument not realizing its UTF-8, and defaulting to some horrible windows-1252 encoding, and proceeds to corrupt actual UTF-8 multibyte characters. with a bit of luck, replacing

@$DOM->loadHTML($html);

with

@$DOM->loadHTML('<?xml encoding="UTF-8">' . $html);

should fix it.

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
0

Changing $DOM->loadHTML($html) to $DOM->loadHTML(mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8")) seems to resolve the issue.

Source: PHP DOMDocument loadHTML not encoding UTF-8 correctly

Marco
  • 7,007
  • 2
  • 19
  • 49
  • If this question is a duplicate of the linked one, please mark it as such – Nico Haase Feb 01 '23 at 09:10
  • @NicoHaase It's not a duplicate. The referenced question is about `DOMDocument` specifically, this question is not about `DOMDocument`. – Marco Feb 01 '23 at 09:13