Am getting HTML from cURL in TWO websites.
SITE 1: https://xperia.sony.jp/campaign/360RA/?s_tc=somc_co_ext_docomo_360RA_banner
My cURL looks like:
$ua= "Mozilla/5.0 (X11; Linux i686; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1";
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_FAILONERROR => true,
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => $ua, // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 10, // timeout on connect
CURLOPT_TIMEOUT => 10, // timeout on response
CURLOPT_MAXREDIRS => 5,
CURLOPT_FORBID_REUSE, true);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
//Use xPath or str_get_html($content) to parse
The FIRST URL opens perfectly encoded and shows characters as expected
Exaple: $title_string = $html->find("title",0)->plaintext shows the <title> tag text and characters well encoded
The SECOND URL shows SQUARE BOXES ¤ããªãããi��Ɨ�
. But when you do utf8_decode( $title_string)
, then this SECOND URL will show well encoded characters as expected.
The problem is, when you use utf8_decode( $title_string)
, the FIRST URL now shows SQUARE BOXES.
Is there a way to have a universal way of solving this issue?
I have tried
$charset= mb_detect_encoding($str);
if( $charset=="UTF-8" ) {
return utf8_decode($str);
}
else {
return $str;
}
Seems both Strings are being encoded as UTF-8 by cURL. One works, the other shows square boxes.
I have also tried
https://www.php.net/manual/en/function.mb-convert-encoding.php
And many more
I have spend critical hours trying to solve this. Any idea is welcome