33

I have an php script which calls another web page and writes all the html of the page and everything goes ok however there is a charset problem. My php file encoding is utf-8 and all other php files work ok (that means there is no problem with server). What is the missing thing in that code and all spanish letters look weird. PS. When I wrote these weird characters original versions into php, they all look accurate.

header("Content-Type: text/html; charset=utf-8");
function file_get_contents_curl($url)
{
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_HEADER,0);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
    $data=curl_exec($ch);
    curl_close($ch);
    return $data;
}
$html=file_get_contents_curl($_GET["u"]);
$doc=new DOMDocument();
@$doc->loadHTML($html);
Bora Alp Arat
  • 2,185
  • 3
  • 16
  • 21

6 Answers6

38

Simple: When you use curl it encodes the string to utf-8 you just need to decode them..

Description

string utf8_decode ( string $data )

This function decodes data , assumed to be UTF-8 encoded, to ISO-8859-1.

03Usr
  • 3,335
  • 6
  • 37
  • 63
julio
  • 396
  • 4
  • 3
16

You Can use this header

   header('Content-type: text/html; charset=UTF-8');

and after decoding the string

 $page = utf8_decode(curl_exec($ch));

It worked for me

phrogg
  • 888
  • 1
  • 13
  • 28
amir rasabeh
  • 427
  • 8
  • 16
4
$output = curl_exec($ch);
$result = iconv("Windows-1251", "UTF-8", $output);
Yusuf Y.
  • 84
  • 10
Taron
  • 169
  • 1
  • 13
3
function page_title($val){
    include(dirname(__FILE__).'/simple_html_dom.php');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$val);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0');
    curl_setopt($ch, CURLOPT_ENCODING , "gzip");
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    $return = curl_exec($ch); 
    $encot = false;
    $charset = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    curl_close($ch); 
    $html = str_get_html('"'.$return.'"');

    if(strpos($charset,'charset=') !== false) {
        $c = str_replace("text/html; charset=","",$charset);
        $encot = true;
    }
    else {
        $lookat=$html->find('meta[http-equiv=Content-Type]',0);
        $chrst = $lookat->content;
        preg_match('/charset=(.+)/', $chrst, $found);
        $p = trim($found[1]);
        if(!empty($p) && $p != "")
        {
            $c = $p;
            $encot = true;
        }
    }
    $title = $html->find('title')[0]->innertext;
    if($encot == true && $c != 'utf-8' && $c != 'UTF-8') $title = mb_convert_encoding($title,'UTF-8',$c);

    return $title;
}
3

I was fetching a windows-1252 encoded file via cURL and the mb_detect_encoding(curl_exec($ch)); returned UTF-8. Tried utf8_encode(curl_exec($ch)); and the characters were correct.

michalzuber
  • 5,079
  • 2
  • 28
  • 29
3

First method (internal function)

The best way I have tried before is to use urlencode(). Keep in mind, don't use it for the whole url; instead, use it only for the needed parts. For example, a request that has two 'text-fa' and 'text-en' fields and they contain a Persian and an English text, respectively, you might only need to encode the Persian text, not the English one.

Second Method (using cURL function)

However, there are better ways if the range of characters have to be encoded is more limited. One of these ways is using CURLOPT_ENCODING, by passing it to curl_setopt():

curl_setopt($ch, CURLOPT_ENCODING, "");
MAChitgarha
  • 3,728
  • 2
  • 33
  • 40
  • 1
    ENCODING accepts the following: The contents of the "Accept-Encoding: " header. This enables decoding of the response. Supported encodings are "identity", "deflate", and "gzip". If an empty string, "", is set, a header containing all supported encoding types is sent. – phrogg Oct 24 '20 at 12:42