0

As i understand by default loadHTML loads in Latin 1 something and i want to convert this into UTF-8 chars. The code is the following:

    // get data from website
    function get_url_contents($url){

            $crl = curl_init();
            $timeout = 5;
            curl_setopt ($crl, CURLOPT_ENCODING, 'UTF-8');
            curl_setopt ($crl, CURLOPT_URL,$url);
            curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);        
            curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
            $ret = curl_exec($crl);
            curl_close($crl);
            return $ret;
    }

// Now here is the domdoc
function get_all_meta_tags($html){

    $html = get_url_contents($html);

    $doc = new DOMDocument('1.0', 'UTF-8');

    $doc->encoding = 'UTF-8';

    $nodes = $doc->getElementsByTagName('title');
    $title = $nodes->item(0)->nodeValue;
    $arr['title']=$title;

    $nodes = $doc->getElementsByTagName('h1');
    $h1 = $nodes->item(0)->nodeValue;
    $arr['h1']=$h1;

    $metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
$mt = $metas->item($i);

if($mt->getAttribute('name')=='description')
$dec=$mt->getAttribute('content');$arr['description']=$dec;
if($mt->getAttribute('name')=='keywords')
$key=$mt->getAttribute('content');$arr['keywords']=$key;
}
return $arr;
}

Now as you can see im grabing data from webpages and the problem is that the word do not convert into UTF-8. For example "Az utolsó dal" needs to bee "Az utolsó dal". Can somebody direct me the the problem or solution?

faq
  • 2,965
  • 5
  • 27
  • 35
  • 4
    that code cannot work. you are not loading the content of the website into the DOMDocument at all. On a sidenote, there is no need to use cURL here because DOMDocument has a `loadHTMLFile` method. – Gordon Nov 05 '12 at 21:27
  • 1
    http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258#11310258 – hakre Nov 05 '12 at 21:44
  • Next to what @Gordon writes, you use `CURLOPT_ENCODING` totally wrong (none of the existing "answers" cover that so far). Take more care what you do. You need to do different to check the encoding of the HTML you get returned from curl. [get curl respone encoding](http://stackoverflow.com/q/5937943/367456) – hakre Nov 05 '12 at 21:51

3 Answers3

5

There is a hack to force UTF-8 for HTML documents, just read them as XML:

$dom->loadHTML( '<?xml encoding="UTF-8">' . $content );

For you case:

$html = get_url_contents($html);

// this is necessary to prevent DOMDocument errors on HTML5-elements
libxml_use_internal_errors( true );

$doc = new DOMDocument();

// UTF-8 hack, to correctly handle UTF-8 through DOMDocument
$doc->loadHTML( '<?xml encoding="UTF-8">' . $html );
feeela
  • 29,399
  • 7
  • 59
  • 71
0

Do the encoding before of create the Dom document.

       $html = get_url_contents($html);
       utf8_encode($html);
Eduardo Ortiz
  • 407
  • 5
  • 12
0

Check the encoding of your script ... it should be utf8.

To do this you can use notepad++, and convert your script to UTF8 without BOM.

You can use mb_internal_encoding() to check your internal encoding.

Peter Adrian
  • 279
  • 1
  • 5