0

I was given this UTF-16 XML file to work with. I converted this to UTF-8 (iconv -f UTF16 -t UTF8 'file-utf16.xml' > 'file-utf8.xml') but the result doesn't seem like it's normal text file. I'm using OS X, and when I open this converted file in Sublime Text 2, the following is shown, and simplexml_load_file return false.

<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<Item itemno="0000004" desc="" qtyavail="0" unitprice="0" salesprice="0" block="Yes" dnr="No"/>
<Item itemno="000001" desc="" qtyavail="0" unitprice="199.99" salesprice="199.99" block="No" dnr="No"/>
...

When I open it with textEdit, the characters are all strange. It's a mixture of Chinese characters and some other things like below. There is absolutely no Chinese in the original XML file, just Roman alphabet letters, numbers, and other typical characters used in XML.

㼼浸敶獲潩㵮ㄢ〮•湥潣楤杮∽呕ⵆ㘱•瑳湡慤潬敮∽潮㼢ਾ䤼整瑩浥潮∽〰〰〰∴搠獥㵣∢焠祴癡楡㵬〢•湵瑩牰捩㵥〢•慳敬灳楲散∽∰戠潬正∽教≳搠牮∽潎⼢ਾ䤼整瑩浥潮∽〰〰㄰•敤捳∽•瑱慹慶汩∽∰甠楮灴楲散∽㤱⸹㤹•慳敬灳楲散∽㤱⸹㤹•汢捯㵫丢≯搠牮∽潎⼢ਾ

Is there something wrong with the encoding? If so, how can I make this into a regular text file to be read via simplexml_load_file. If not, what is the problem here? As it is, this simplexml_load_file returns false on this file.

UPDATE: Just realized that when I change the string encoding="UTF-16" to encoding="UTF-8" in the XML file, everything works. Is iconv not enough to convert this to UTF-8?

laketuna
  • 3,832
  • 14
  • 59
  • 104

2 Answers2

0

Try opening it in a browser.

The Xml should have a root tag in order to be well formed.

Also, maybe try changing your encoding settings to UTF-8 WITHOUT BOM.

ubergeekCD
  • 51
  • 3
0

For the XML you've provided - especially with the so called XML Declaration at the beginning of your string:

<?xml version="1.0" encoding="UTF-16" standalone="no"?>

Only changing the encoding of the string (like you did with iconv) is only part of the story. You also need to reflect the endocing with the XML Declaration (and removing of any BOM - Byte Order Mark). One class that does both - re-encoding of the string and taking care of the XML Declaration - is XMLRecoder.

However in your UTF-16 case, this should not be necessary at all as UTF-16 is supported by SimpleXML (if your iconv has it which is normally the case).

So you need to find out about which errors you get in concrete when simplexml_load_file returns FALSE as that return value signals an error condition - the XML could not have been parsed.

To do so, please enable your error reporting to the highest level as you're developing. Also log errors and follow the error log. A related Q&A is:

Just saying, you for sure can use the XMLRecoder if it helps.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • So, it looks like `simplexml_load_file` will work with the original file if I replace the string `UTF-16` with `UTF-8` without using `iconv`. Is this safe, or should I convert it with `iconv` and replace the string within the file? The file will not load and return `false` if I try to load it as-is. Unfortunately, I'm unable to change error-reporting settings. – laketuna Oct 29 '13 at 17:02
  • Well, what you comment could be a sign that there was a data-transmission problem, so yes, you need to fix the XML declaration so that it matches the encoding of the document. As you write the document is actually UTF-8 you can also remove the XML declaration because this is the default encoding. – hakre Oct 29 '13 at 17:07
  • Hmmm, actually the string-replaced file works with `simplexml` only for the first few lines I copied and pasted into another file. As a whole, the file cannot be loaded even with `iconv` and `UTF-8` in the header.. This is frustrating. – laketuna Oct 29 '13 at 17:34
  • well, you need to find out in which encoding that original file actual ly is. Then you need to verify if it has a BOM and if so, if it is the right one. And also if the encoding in the XML Declaration of that file is the right one. First gather these basic informations. If you have those, it's easy (easier) to say which way to go. – hakre Oct 29 '13 at 17:38
  • I think the original file has BOM. I'm seeing two characters with ASCII codes `255`, `254` in that order. I'm trying to see if I can remove it, but it's not clear how to remove this. All I know about the encoding of the org file is that it says "UTF-16". No one has any other info, but from searching online, it's something called `UTF-16 (LE)`? – laketuna Oct 29 '13 at 17:57
  • The BOM you have (255, 254) is UTF-16 (LE). If the rest of the string is actually UTF-8 encoded, just remove the first two bytes. You can use substr for that: http://php.net/substr or the said `XMLRecoder` class. – hakre Oct 30 '13 at 07:03