4

i get page in utf-8 with russian language using curl. if i echo text it show good. then i use such code

$dom = new domDocument; 

        /*** load the html into the object ***/ 
        @$dom->loadHTML($html); 

        /*** discard white space ***/ 
        $dom->preserveWhiteSpace = false; 

        /*** the table by its tag name ***/ 
        $tables = $dom->getElementsByTagName('table'); 

        /*** get all rows from the table ***/ 
        $rows = $tables->item(0)->getElementsByTagName('tr'); 

        /*** loop over the table rows ***/ 
        for ($i = 0; $i <= 5; $i++)
        { 
            /*** get each column by tag name ***/ 
            $cols = $rows->item($i)->getElementsByTagName('td'); 

            echo $cols->item(2)->nodeValue; 

            echo '<hr />'; 
        } 

$html contains russian text. after it line echo $cols->item(2)->nodeValue; display error text, not russian. i try iconv but not work. any ideas?

cetver
  • 11,279
  • 5
  • 36
  • 56
kusanagi
  • 14,296
  • 20
  • 86
  • 111

3 Answers3

13

I suggest use mb_convert_encoding before load UTF-8 page.

    $dom = new DomDocument();   
    $html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
    @$dom->loadHTML($html);

OR else you could try this

    $dom = new DomDocument('1.0', 'UTF-8');
    @$dom->loadHTML($html);
    $dom->preserveWhiteSpace = false;
    ..........
    echo html_entity_decode($cols->item(2)->nodeValue,ENT_QUOTES,"UTF-8");
    .......... 
Peter
  • 904
  • 1
  • 13
  • 26
Asif Mulla
  • 1,652
  • 2
  • 22
  • 32
  • loadHTML only supports ISO-88591 from my understanding. This is why you have to encode all utf-8 characters into their entities (which are really utf-16 entities). If you want to avoid the conversion you could use loadXML which supports utf-8, however loadXML is very strict on broken elements, plus you have to do a lot of string fixes for non-closing elements like
    – Joseph Montanez Jul 12 '11 at 00:20
1

The DOM cannot recognize the HTML's encoding. You can try something like:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// taken from http://php.net/manual/en/domdocument.loadhtml.php#95251
bisko
  • 3,948
  • 1
  • 27
  • 29
  • It's what it does. HTML is basically an XML document with a given definition. You could always just try it and see if it works. – bisko Oct 06 '10 at 13:19
0

mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");

The same thing worked for PHPQuery.

P.S. I use phpQuery::newDocument($html);

instead of $dom->loadHTML($html);

cofirazak
  • 562
  • 6
  • 16