1

The simplexml_load_file() function doesn't parse the accent characters well. The file is UTF-8 encoded, the xml tag has encoding="UTF-8".

I'm importing an XML file encoded in UTF-8 with simplexml_load_file() function. This file has some accent characters, and when I do a print_r() or var_dump() the accent characters are converted to strange characters.

First line in XML file is

<?xml version="1.0" encoding="UTF-8"?>

In code I'm running the basic

$xFile = simplexml_load_file($xmlFile)

I'm looping through the SimpleXML Object and fetching the word with accent characters like so

$text = (string)$p->i

Now

var_dump($text);

shows Ge├»rriteerd instead of Geïrriteerd

I've tried to get_file_contents() and then simplexml_load_string() and I've also tried to load the XML file with DOMDocument, but the same 'wild' characters are being displayed.

Any thoughts on what else could I do?

Note: I'm working on PHP5.4, that's the PROD version and I can't change it.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • 1
    _"when I do a print_r() or var_dump()"_ assuming you're looking at this in your browser, have you made sure to set the page charset correctly? See [UTF-8 all the way through](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through) – Phil Sep 02 '19 at 05:50
  • The var_dump() it's in a console (ssh), because the parsing is done in a cronjob. – Narcis Cotaie Sep 02 '19 at 12:49
  • Ok, so what's the encoding in your console? – Phil Sep 02 '19 at 13:51
  • 1
    The console encoding was indeed one of the issues, after that I've identified a ```json_encode()``` that was converting the UTF-8 chars to hexadecimals. I've fixed that by passing ```JSON_UNESCAPED_UNICODE``` as a second param to the ```json_encode()```. Source of fix : https://stackoverflow.com/questions/16498286/why-does-the-php-json-encode-function-convert-utf-8-strings-to-hexadecimal-entit – Narcis Cotaie Sep 06 '19 at 13:37

1 Answers1

1

The issue was a windows console default encoding. I've changed the encoding to UTF-8 by running chcp 65001.

@Phil's comment was helpful.