1

I importing an XML file from an API. I use simplexml_load_file() to load the file. The XML has UTF-8 encoding. In the title of some element special chars accur, like "café". If I output these titles directly in the browser I see them just fine, but I'm storing all the values in a MySQL database. Although the table has UTF-8 encoding the values are stored like "Paardcafé". I tried differend enconding, htmlentities etc but the output stays the same. Any reason why this hapens even though the encoding doesnt change from original file to storing into the database?

Jerome Bohg
  • 476
  • 4
  • 7
  • please look here http://stackoverflow.com/questions/279170/utf-8-all-the-way-through – ins0 Jul 31 '14 at 10:57

2 Answers2

1

Encoding is like a chain, if one link breaks, the whole chain breaks.

As encoding is meta-information next to the actual data, different processes working with that data need to have that meta-information. If one process has the wrong encoding information, a wrong encoding appears.

In your case you have the following chain:

API -> XML -> SimpleXMLElement -> PHP Variable -> Database Link -> Database Store

The problem is in the part at the end, for the API, the XML, the SimpleXMLElement and the PHP Variable you've already checked, that the encoding is correct. So left is the ending part:

PHP Variable -> Database Link -> Database Store
   *good*                            *bad*

As you can see, the Database Link is in between. So what happens there?

The Database Link carries information in which encoding characters passed from PHP are encoded when given to the Database Store.

How does that look like? Let's see your word in question in PHP code:

$word = "café";

Let's say this PHP code is encoded as UTF-8 (it's also here on Stackoverflow when your browser requests the answer). Within the computers memory, it's stored as binary data. That is byte by byte and in that case if you would look into the memory you would see something like this:

636166c3a9

These are five bytes representing the characters of "café" as UTF-8:

c := 63
a := 61
f := 66
é := c3a9

Like with the binary data in memory, it's similiar with how the Database Link transfers the characters to the database server: as binary data.

So the other end of the link needs to know how to interpret this binary sequence. For that, it needs to know the encoding because the database then needs to store the data into the column of which you said it's UTF-8 encoded.

So for example, if the Database Server becomes binary character data via the Database Link which is in a different encoding than the Column needs to have, the Database Server re-encodes the binary data passed through the link into the encoding needed to store the data in the column:

link: <data:latin1>  -- (re-encode) --> column: <data:utf8>

So let's say, the Database Link has the information that the characters would be encoded as ISO-8859-1 (latin 1) while it sends the binary sequence <636166c3a9> from above. And the column needs to have UTF-8. Then the database server would re-encode that sequence:

(latin1) 636166c3a9 ---> (utf8) 636166c383c2a9

However for the same operation, if the database link would have carried the information that the data is encoded as UTF-8, then no re-encoding would be necessary, because it's already in the right encoding for the Column:

(utf8) 636166c3a9 ---> (utf8) 636166c3a9 

So let's compare from those last two examples the binary UTF-8 sequences in readable characters:

636166c383c2a9  := café
636166c3a9      := café

Looks familiar? So even the data in the PHP code has UTF-8 encoding and the encoding of the database column is UTF-8, the encoding can still be broken, if the Database Link carries the wrong encoding information.

So how can you tell the Database Link which encoding to use? That depends on the database driver. You're using Mysql, so you're perhaps using PDO in PHP. For PDO Mysql you add the charset parameter to the DSN, e.g. ";charset=utf8" at the end of it:

$pdo = new PDO("mysql:host=localhost;dbname=world;charset=utf8", "my_user", "my_password");

Or if you're using Mysqli, you call the set_charset method:

$mysqli = new mysqli("localhost", "my_user", "my_password", "world");
$mysqli->set_charset('utf8');

I hope this makes the issue more visible to you and also shows you how you can change the setting.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • this is great stuff and makes the whole understanding of how encoding works so much clearer. Thanks for pointing this out. Although it all gives me a lot more information and knowledge for future problems, the Mysql connection does seem to be set in UTF-8. So I guess I'll have to dig in a little deeper. Thanks for the insights so far and I will keep looking where the encoding chain breaks in my script. – Jerome Bohg Aug 02 '14 at 20:58
  • This just helped me immensely to understand what is wrong with my website, thanks. – Chris Evans Dec 08 '16 at 01:27
0

Check the encoding of the connection from PHP to the database.

Peter
  • 138
  • 3