0

In addition to my previous question about parsing images and text from complex xml, only problem about that now is that i don't get the right encoding. Text is in greek, the xml file has utf-8 encoding. This is the code to parse xml:

$xml = simplexml_load_file('myfile.xml');

$descriptions = $xml->xpath('//item/description');

foreach ( $descriptions as $description_node ) {

    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    $description_sxml = simplexml_import_dom( $description_dom );

    $imgs = $description_sxml->xpath('//img');
    $text = $description_sxml->xpath('//div');

    foreach($imgs as $image){

    echo (string)$image['src'];     
       }

    foreach($text as $t){
    
        echo (string)$t;
       }
    }

If i echo $description_node,text looks fine, but after i get $description_dom with simplexml_import_domit looks like this: Ïε ιÏÎ»Î±Î¼Î¹ÎºÎ­Ï ÎºÎ¿Î¹Î½ÏÏηÏεÏ.Using mb_convert_encoding turns it to: ýÃÂñù" ÃÂ. What am i doing wrong?

Community
  • 1
  • 1
pano
  • 13
  • 6
  • Did you try to add from_encoding param? like `mb_convert_encoding($str, "UTF-7", "EUC-JP");` also use proper encoding for `DomDocument` –  Jan 15 '13 at 11:12
  • When you "echo" some string to your browser, make sure you do it from a well formed HTML page with UTF-8 charset specified : ` ` That can already save you from some useless headache. – darma Jan 15 '13 at 11:14
  • `simplexml_load_file` already loads everything in utf-8, try removing the `,'utf-8'` additional conversion maybe – povilasp Jan 15 '13 at 11:16
  • @PeterM yes, but convert it to utf-8 from ..what? – pano Jan 15 '13 at 11:21
  • @pano encoding from myfile.xml. Maybe it is in different encoding? If so try to convert file *before* passing to `simplexml_load_file`, in that case `simplexml_load_string` should be used. Also try to display raw xml file in browser and see if it's rendered correctly. –  Jan 15 '13 at 14:15

3 Answers3

1

Solution: after $description_dom = new DOMDocument(); , i placed this code.

$description_html = mb_convert_encoding($description_node, 'HTML-ENTITIES', "UTF-8");

Simply converts html entities to UTF-8. Instead of

$description_dom->loadHTML( (string)$description_node );

now i load the converted html

$description_dom->loadHTML( (string)$description_html );
pano
  • 13
  • 6
  • So, you had special characters encoded as html hex values, etc., right? –  Jan 25 '13 at 18:58
0

Add this to the head of your HTML page where you want the text to be displayed :

<meta http-equiv='Content-Type' content='text/html; charset=utf-8'>

This should render the characters properly.

user1362916
  • 119
  • 2
  • 14
0

Do not convert anything.. just print it with proper declaration

header("Content-Type: text/plain; charset=utf-8");

This is all you need to do. Do it at the top of your file.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • @pano Start with a very simply php script like ` `. After that, start gradually adding your own code and see where it starts to fail. – Esailija Jan 15 '13 at 11:38
  • already did and it starts to fail after `$description_sxml = simplexml_import_dom( $description_dom );` – pano Jan 15 '13 at 11:48
  • @pano That function should not be a problem at all. Don't have access to documentation right now to confirm. – Esailija Jan 15 '13 at 11:51
  • If i `echo $description_node` it's ok. If i `echo $description_sxml`, it's not. – pano Jan 15 '13 at 11:54
  • @pano you can avoid most of the stuff by doing `$imgs = $description_node->xpath('//img'); $text = $description_node->xpath('//div');` – Esailija Jan 15 '13 at 11:56
  • sorry, i meant if i `echo` after `$description_sxml` – pano Jan 15 '13 at 11:58
  • thanks for the help. The reason i use `simplexml_import_dom` is because it 'clears' the content to pure text with no styling(images too). – pano Jan 15 '13 at 12:08