3

I have some problems using xml. I know this is a comon question, but the answers i found didn't fix my problem. The problem is that when I add é or ä or another special char to my xml file, with php domdocument, it saves the é as xE9 and the ä as xE4. I don't know if this is ok but when I want to show the output it shows question marks at this places. I have tried alot. Like removing and adding the encoding in de xml header in the php domdocument. I also tried using file_get_contents and use php utf-8_decode to get the xml. I tried using iso intead, but nothing solved my problem. Instead I got php xml parse errors sometimes. I must do something wrong, but what? Thats my question and how I can solve this problem. My xml file looks like this: the xE9 and the xE4 have black backgrounds.

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <row id="1">
    <question>blah</question>
    <answer>blah</answer>
  </row>
  <row id="2">
    <question>xE9</question>
    <answer>xE4</answer>
  </row>
</root>

and a part of my php xml class

function __construct($filePath) {
    $this->file = $filePath;
    $this->label = array('Vraag', 'Antwoord');
    $xmlStr = file_get_contents($filePath);
    $xmlStr = utf8_decode($xmlStr);
    $this->xmlDoc = new DOMDocument('1.0', 'UTF-8');
    $this->xmlDoc->preserveWhiteSpace = false;
    $this->xmlDoc->formatOutput = true;
    //$this->xmlDoc->load($filePath);   
    $this->xmlDoc->loadXML($xmlStr);
}       

this is the add new row function

//creates new xml row and saves it in xml file
function addNewRow($question, $answer) {
    $nextAttr = $this->getNextRowId();
    $parentNode = $this->xmlDoc->documentElement;
    $rowNode = $this->xmlDoc->createElement('row');
    $rowNode = $parentNode->appendChild($rowNode);
    $rowNode->setAttribute('id', $nextAttr);    
    $q = $this->xmlDoc->createElement('question');
    $q = $rowNode->appendChild($q);
    $qText = $this->xmlDoc->createTextNode($question);
    $qText = $q->appendChild($qText);
    $a = $this->xmlDoc->createElement('answer');
    $a = $rowNode->appendChild($a);
    $aText = $this->xmlDoc->createTextNode($answer);
    $aText = $a->appendChild($aText);
    $this->xmlDoc->save($this->file);
}

everything works fine till I add spcial chars. Those are shown as questionmarks.

Zeebats
  • 480
  • 7
  • 22
  • You refer special chars, yet your XML sample doesn't have any. What do you mean by **it saves the é as xE9 and the ä as xFC**. Your PHP code just shows you loading the XML into a DOMDocument object. – Rolando Isidoro Apr 30 '13 at 21:27
  • The xml example is just an example of the structure of the xml. As I'm explaining above it saves the xE9 and xFC into the xml file if I add é and ä to the file. This is done by an html input field with a php function in my class. – Zeebats Apr 30 '13 at 21:51
  • Without the real example how to you expect to get help? – Rolando Isidoro Apr 30 '13 at 22:08
  • I edit my question. I hope it is more clear now. – Zeebats May 01 '13 at 08:31

1 Answers1

6

Okay the following is now a bit rough/verbose, especially as you already tried so much. Just try to keep fresh eyes and consider that once you do only a little mistake with encoding, it is often already screwed. Therefore it is important to properly understand which mechanics are at work here.

I try to address some of these mechanics that are operating in DOMDocument in PHP. You might find this interesting or daunting and perhaps even at the end the solution is very simple and you don't even need to change your PHP code, but I'd like to address this anyway because it is not much documented on Stackoverflow and the PHP manual and it's good to have more reference material as it is important to properly understand - as I already wrote.

So by default XML is in UTF-8. UTF-8 is pretty much the perfect choice for the internet nowadays. Sure this is not totally true in and for all cases, but generally, it is a safe bet. So XML on it's own and with it's default encoding UTF-8 is super fine.

What does this mean for DOMDocument? Just that by default DOMDocument will take this encoding and we do not need to care about that. Here is a simple show of that, output follows commented:

$doc = new DOMDocument();
$doc->save('php://output');
# <?xml version="1.0"?>

This very short example shows the default UTF-8 encoding PHP has for the DOMDocument. This document even still not containing a root-node already shows the default XML UTF-8 encoding by not specifying one in the XML declaration: <?xml version="1.0"?>.

So you might say "but I want", and sure you can. This is what the encoding parameter of DOMDocument is for when you call the constructor:

$doc = new DOMDocument('1.0', 'UTF-8');
                               #####  Encoding Parameter
$doc->save('php://output');
# <?xml version="1.0" encoding="UTF-8"?>

As this shows, what we use as first (version) and second (encoding) parameter will be written out. So yes, we can do things that are not allowed. But what is allowed in this XML Declaration? There is one XML version AFAIK and that is 1.0. Therefore the version parameter must be 1.0 always. And what is allowed for the encodings? XML specs say all the IANA characters sets, in short it should be one of these common ones (should, not must): UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, EUC-JP. Okay wow, this already is a long list.

So lets take a look what does PHP's DOMDocument allow us practically:

$doc = new DOMDocument('♥♥ love, hugs and kisses ♥♥', 'UTF-8');
$doc->save('php://output');
# <?xml version="♥♥ love, hugs and kisses ♥♥" encoding="UTF-8"?>

The encoding works as expected, the version is cosmetic, but it shows: This is using Unicode characters encoded as UTF-8. Now let's change the encoding to something different:

$doc = new DOMDocument('♥♥ love, hugs and kisses ♥♥', 'ISO-8859-1');
$doc->save('php://output');
# <?xml version="&#9829;&#9829; love, hugs and kisses &#9829;&#9829;" encoding="ISO-8859-1"?>

Because the Unicode hearts do not have a place in ISO-8859-1, they are replaced with their according numeric HTML entity (&#9829;). And what happens if we add an ISO-8859-1 character like ö (binary string in PHP "\xF6") directly in there?

$doc = new DOMDocument("♥♥ l\xF6ve, hugs and kisses ♥♥", 'ISO-8859-1');
$doc->save('php://output');
# Warning: DOMDocument::save(): output conversion failed due to conv error, 
#          bytes 0xF6 0x76 0x65 0x2C
#                ^^^^  |    |    |
#                "ö"   v    e   space

This does not work. DOMDocument tells us that the information we have provided can not be turned into ISO-8859-1 output. This is expected: DOMDocument expects all input given being UTF-8. So lets take ö from unicode this time:

$doc = new DOMDocument('♥♥ löve, hugs and kisses ♥♥', 'ISO-8859-1');
$doc->save('php://output');
# <?xml version="&#9829;&#9829; l�ve, hugs and kisses &#9829;&#9829;" encoding="ISO-8859-1"?>

This looks now fine despite this question mark in a diamond. Because on my computer the display/output is in UTF-8 it can not display the ISO-8859-1 ö character here. So my display replaces it with the � Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). Which is correct, the "ö" now works.

This so far makes clear that you can only pass UTF-8 encoded strings into DOMDocument and that is regardless of the XML encoding you have specified for that document.

So let's break this rule with an UTF-8 document as in your question and add some non-UTF-8 text, for example in ISO-8859-1 resp. Windows-1252:

$doc = new DOMDocument('1.0', 'UTF-8');

$doc->appendChild($doc->createElement('root'))
    ->appendChild($doc->createElement('question'))
    ->appendChild($doc->createTextNode("l\xF6ve, hugs and kisses"));

$doc->save('php://output');
# <?xml version="1.0" encoding="UTF-8"?>
# <root><question>l�ve, hugs and kisses</question></root>

Depending with which program you view the output, it might show not the question mark � but just "xF6". I would say that is the case with your file-editor.

So this is also the solution: When you pass in string-data into DOMDocument, ensure it is UTF-8 encoded:

->appendChild($doc->createTextNode(utf8_encode("l\xF6ve, hugs and kisses")));
                                   ########### (works with ISO-8859-1 only (!))

# <?xml version="1.0" encoding="UTF-8"?>
# <root><question>löve, hugs and kisses</question></root>

Or in your case, tell the browser that your website expects UTF-8. Then you don't need to re-encode anything because your browser already sends the data in with the right encoding. The W3C has collected some useful resources for the topic I suggest you to read now:

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Yep that did the trick. Putting the meta tag in the header of my html. I tried php utf-8_encode and decode but that returned different characters like äé. Thanx again hakre. Now my web app is ready to lunch. – Zeebats May 02 '13 at 08:11
  • Very good! That is what I assumed as well would be your issue but I also wanted to show how the DOMDocument encoding parameter works so you can better find the place where to specify the encoding (or where to re-encode if at all). BTW which editor were you using for the xXX display? Is that Notepad++ probably? – hakre May 02 '13 at 08:16
  • Notepad++ is a fine editor. As you can see it even showed you the binary sequence (hex-value) of the wrongly encoded character. That's more than just displaying a question mark. Is there anything you have the feeling is "too short" with Notepdad++ or are you just curious? – hakre May 02 '13 at 08:30
  • Just wondering if there are better editors. I had crimson editor before, and I must say that I prefer notepad++. I use eclipse for android and java, but I'm not aware of what is the best editor. I see alot of fellow students have black screen editors, but i think they just changed the background and text options in their editors. – Zeebats May 02 '13 at 09:20
  • Okay, if you know eclipse, there is a PHP plugin it's called *PDT*. The best PHP IDE you can get is called PHPStorm. So it depends a bit where you feel more at home. There is also Netbeans which has PHP support and that pretty much is it for PHP IDEs. I can not recommend the IDE/Editor by Zend (also Eclipse based), instead either stick to PDT or buy PHPStorm IMHO. Used Crimson years ago, but dropped it. On Windows I use an Editor called EditPlus. – hakre May 03 '13 at 13:58