0

I have problems with wrong character encoding while reading a xml-file.

While this one shows the complete content of the file correctly...

$reader = new DOMDocument();
$reader->preserveWhiteSpace  = false;
$reader->load('zip://content.odt#content.xml');
echo $reader->saveXML();

...this one gives me a strange output (german umlauts, em dashes, ยต or similar characters aren't shown correctly):

$reader = new DOMDocument();
$reader->preserveWhiteSpace  = false;
$reader->load('zip://content.odt#content.xml');
$elements = $reader->getElementsByTagName('text');
foreach($elements as $node){
    foreach($node->childNodes as $child) {
        $content .= $child->nodeValue;
    }
}
echo $content;

I don't know why this is the case. Hope someone can explain it to me.

hakre
  • 193,403
  • 52
  • 435
  • 836
user3142695
  • 15,844
  • 47
  • 176
  • 332

1 Answers1

0
DOMDocument::saveXML()

This method returns the whole XML document as string. As with any XML document, the encoding is given in the XML declaration or it has the default encoding which is UTF-8.

DOMNode::$nodeValue

Contains the value of a node, most often text. All text-strings the DOMDocument library returns - of which DOMNode is part of - is in UTF-8 encoding regardless of the encoding of the XML document.

As you write that if you display the first:

echo $reader->saveXML();

all umlauts are preserved, it's most likely the XML itself ships with a different encoding as UTF-8 because the later

$content .= $child->nodeValue;
...
echo $content;

doesn't do it.

As you don't share how and with which application you're displaying and reading the output, not much more can be said.

You most likely need to hint the character encoding in the later case to the displaying application. For example, if you display text in a browser, you should add the appropriate content-type header at the very beginning:

header("Content-Type: text/plain; charset=utf-8");

Compare with How to set UTF-8 encoding for a PHP file.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836