0

Am getting HTML from cURL in TWO websites.

SITE 1: https://xperia.sony.jp/campaign/360RA/?s_tc=somc_co_ext_docomo_360RA_banner

SITE 2: https://www.fidelity.jp/fwe-top/?utm_source=outbrain&utm_medium=display&utm_campaign=similar-gdw&utm_content=FS001&dicbo=v1-b6eb7c5f86a6978bba74e3703a046886-00d8ad90c4cb65b2bdcc239bcccf5ec378-mnrtcytfgu4toljwgjrwgljumu4wmljzg5tgkljxgzsdgzbqmyzwenbsgy

My cURL looks like:

$ua= "Mozilla/5.0 (X11; Linux i686; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1";     
$options = array(
                CURLOPT_RETURNTRANSFER => true, // return web page
                CURLOPT_FAILONERROR => true, 
                CURLOPT_FOLLOWLOCATION => true, // follow redirects
                CURLOPT_ENCODING => "", // handle all encodings 
                CURLOPT_USERAGENT => $ua, // who am i
                
                       
                CURLOPT_AUTOREFERER => true, // set referer on redirect
                CURLOPT_CONNECTTIMEOUT => 10, // timeout on connect
                CURLOPT_TIMEOUT => 10, // timeout on response
                CURLOPT_MAXREDIRS => 5,
                CURLOPT_FORBID_REUSE, true);
        
        $ch = curl_init($url);
            curl_setopt_array($ch, $options);
            $content = curl_exec($ch);

         //Use xPath or str_get_html($content) to parse

The FIRST URL opens perfectly encoded and shows characters as expected

Exaple: $title_string = $html->find("title",0)->plaintext shows the <title> tag text and characters well encoded

The SECOND URL shows SQUARE BOXES ¤ããªãããi��Ɨ� . But when you do utf8_decode( $title_string), then this SECOND URL will show well encoded characters as expected.

The problem is, when you use utf8_decode( $title_string), the FIRST URL now shows SQUARE BOXES.

Is there a way to have a universal way of solving this issue?

I have tried

$charset=  mb_detect_encoding($str);
    if( $charset=="UTF-8" ) {
        return utf8_decode($str);
    }
    else {
        return $str;
    }

Seems both Strings are being encoded as UTF-8 by cURL. One works, the other shows square boxes.

I have also tried

php curl response encoding

Strange behaviour when encoding cURL response as UTF-8

Replace unicode character

https://www.php.net/manual/en/function.mb-convert-encoding.php

Which charset should i use for multilingual website?

French and Chinese characters are not appearing correctly

And many more

I have spend critical hours trying to solve this. Any idea is welcome

ErickBest
  • 4,586
  • 5
  • 31
  • 43

2 Answers2

2

Both pages are UTF-8 encoded, and cURL returns that as is. The problem is the following processing; assuming that libxml2 is involved, it tries to guess the encoding from <meta> elements, but if there are none, it assumes ISO-8859-1. It can be forced to assume UTF-8, if an UTF-8 BOM ("\xEF\xBB\xBF") is preprended to the HTML.

cmb
  • 635
  • 8
  • 8
0

As mentioned by @cmb in the answer above, for those who would like to see my Final code in full details. Here you go

$url = "https://stackoverflow.com/
 
$html = str_get_html($url);

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings

    $doc = new DomDocument();
    $doc->loadHTML("\xEF\xBB\xBF$html"); // This is where and how you put the BOM
    $xpath = new DOMXPath($doc);
    $query = '//*/meta[starts-with(@property, \'og:\')]';
    $metas = $xpath->query($query);
    $rmetas = array();

    foreach ($metas as $meta) {
        $property = $meta->getAttribute('property');
        $content = $meta->getAttribute('content');
        $rmetas[$property] = $content;
    }

    var_dump($rmetas);

Hope it helps someone in the same peril.

ErickBest
  • 4,586
  • 5
  • 31
  • 43