-1

I'm parsing an XML-Feed which contains UTF-8 encoded characters like this:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
  <value>Ströng</value>
</root>

Parsing this file with returns a malformed Ströng:

$file = file_get_contents($path);
print_r($file);

Using $xml = simplexml_load_file($path); yields the same result.

Now I've tried to use the utf8_encode() function to correct the character encoding like that:

$file = utf8_encode(file_get_contents($path));
print_r($file);

But now the content gets even worse malformed: Ströng. Why is that?

How to parse XML in UTF8 format correctly?


Update:

mb_detect_encoding($file) returns: UTF-8 and utf8_decode() returns Str?ng.

Everything seems correct so far but it isn't?

q9f
  • 11,293
  • 8
  • 57
  • 96
  • because you need to utf_decode, or make your php script utf8 – x4rf41 Aug 28 '13 at 12:34
  • 1
    `file_get_contents` does *nothing* with the encoding. You're simply not telling the browser to handle it correctly. See [UTF-8 all the way through](http://stackoverflow.com/questions/279170/utf-8-all-the-way-through) and [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/). – deceze Aug 28 '13 at 12:42
  • `utf8_decode()` returns `Str?ng` – q9f Aug 28 '13 at 12:42
  • `mb_detect_encoding()` returns `UTF-8`, should be fine? – q9f Aug 28 '13 at 12:52

2 Answers2

2

Parsing this file with returns a malformed Ströng:

That probably isn't what happens: it's very likely that your output page is encoded in a single-byte encoding like ISO-8859-1. Hence, the two-byte UTF-8 character will show up wrong even though the data is perfectly fine.

Either:

  • utf8_decode() the result (if you in fact are using ISO-8859-1 for output)
  • use iconv() to convert the result (if you are using a single-byte encoding other than ISO-8859-1)
  • ...or change your output encoding to UTF-8 (preferable because it's the most universal solution.)
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
1

Are you setting the charset to UTF-8 in your document(where the print_r outputs)? You can do this by adding:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

to the <head> section.

Or in PHP add a header('Content-Type: text/html; charset=utf-8');

Iansen
  • 1,268
  • 1
  • 10
  • 14