0

Why does PHP store characters such as Japanese in MySQL table that supports utf8 as something else but successfully reads the value back out from MySQL as the original string?

E.g.

$db = new mysqli("localhost", "user", "pwd", "test");
$sql = "INSERT INTO testtable(name) VALUES ('ボーナスエリア');

From workbench this has been inserted into the table as ディション I have no idea how or at what level that encoding/mapping happens.

Reading it back out in PHP results in the correct string ボーナスエリア being displayed on the webpage. Why and how does that work?

UPDATE Thanks for all the comments so far. More than just being curious it actually causes me a problem wanting to insert chars from another source i.e. Java which through jdbc inserts CJK chars correctly. This causes a problem in PHP reading them back out and displaying as ??????

Can anybody prove what encoding translates the characters given into what appears in db viewer?

UPDATE 2 My browser (which has nothing to do with this problem as value is ???? before it displays) is firefox with encoding set to Western ISO-8859-1. I can see Japanese characters display correctly next to ????? characters. Paradoxically, the characters that appear as ???? appear correctly in the db viewer.

Browser Settings

browser settings

Web page snippet

web page snippet

Peter Kelly
  • 14,253
  • 6
  • 54
  • 63
  • I want to input Japanese characters from another source (Java) into the table and they show as the original utf8 encoding which breaks PHP reading them back out – Peter Kelly Jun 14 '12 at 14:06
  • @PLB Workbench can show utf-8 characters no problem. – Peter Kelly Jun 14 '12 at 14:07
  • @PeterKelly what encoding are you viewing the page in, in your browser? tools->encoding in chrome – Esailija Jun 14 '12 at 14:11
  • If you use phpMyAdmin, be sure that the phpMyAdmin files are in UTF-8 (ASCII without UTF-8 BOM). – David Bélanger Jun 14 '12 at 14:13
  • @Esailija webpage displays values read from db. It correctly shows the chars entered in by PHP (encoded in ASCII?) but shows chars entered with utf8 as ????? – Peter Kelly Jun 14 '12 at 14:14
  • @DavidBélanger Thanks, I don't use phpMyAdmin – Peter Kelly Jun 14 '12 at 14:15
  • @PeterKelly I don't think you understand. Computer stores everything in bytes. What you see on your screen is only dependent on how that program interprets those bytes. Your db doesn't save characters, it saves bytes. The byte sequence `0x20AC`, might show up as `€`, it might also show up as `¬`, it all depends on how it's interpreted. Please look up the encoding in your browser when viewing the page. – Esailija Jun 14 '12 at 14:18
  • @Esailija I want to add this to your comment : charset is like a translator for the computer... It translate let's say `0x20AC` to whatever the language is ! If UTF-8 it may be `ç` but for ASCII it may be `#`... – David Bélanger Jun 14 '12 at 14:20
  • @Esailija thank you for the patronising computer science 101. Computer stores everything in bytes shocker! If this was only an issue with application displaying then why does PHP fall over trying to read the actual values from mysql vs the ascii (or whatever) encoded values - if they represent the same value? – Peter Kelly Jun 14 '12 at 14:27
  • @Esailija to satisfy your question, Western ISO-8859-1. What you don't understand is that the value is wrong before it is displayed on the webpage so browser encoding is irrelevant. – Peter Kelly Jun 14 '12 at 14:32
  • Your browser can't be in ISO-8859-1 and display japanese characters. – Esailija Jun 14 '12 at 14:37
  • No, it can't http://htmlhelp.com/reference/charset/ – rlemon Jun 14 '12 at 14:47
  • I must be seeing things so - see my update. – Peter Kelly Jun 14 '12 at 14:50
  • @rlemon that is a page about the character set provided by that encoding. It says nothing about browser settings and viewing characters outside of that set. – Peter Kelly Jun 14 '12 at 14:52
  • If your html meta tag defines the charset, then the browser settings are irrelevant. – Peter Kelly Jun 14 '12 at 14:56
  • Besides, this appears before the webpage, it is not an application issue! – Peter Kelly Jun 14 '12 at 14:57
  • The `default character encoding` is a different setting than I had in mind. From `Web Developer -> Character Encoding` you can not only force the page to be interpreted in different encoding but also see what the browser is currently using to view the page as well - and it won't be ISO-8859-1. – Esailija Jun 14 '12 at 15:01
  • Ah okay, fair enough. I didn't realise that was what you were getting at. It must be an encoding problem. PHP inserts must be encoded as latin1 and some mapping/guesses made for chars outside that set. utf8 inserts go as utf8 so php can't handle it. – Peter Kelly Jun 14 '12 at 15:03
  • @PeterKelly if you are going utf-8 all the way, then there should be no problem. At some point you are converting character sets. For example `utf8_decode` converts to Latin-1 (ISO-8859-1). Check that you are doing all this: http://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Esailija Jun 14 '12 at 15:05

1 Answers1

1

PHP treats text mostly as arbitrary binary data. This means that in these cases it's quite common for two errors to cancel each other out.

For example, if you write ボーナスエリア in a source file and save it in UTF-8, what PHP sees are the bytes \xe3\x83\x9c\xe3\x83\xbc... and that's what it will work with. You can pass that string to a database client library, like here to mysqli, and, if you are lucky, when you later get the text back from the database the client library will return the exact same bytes to PHP. Independently of how the database actually stored the data.

What seems to be happening here is that the database client library is configured to interpret the data PHP hands to it according to latin1, which means that it interprets the bytes \xe3\x83\x9c... as the characters デ..., and that's what the database will store. When you read the data the same thing happens: the client obtains the characters デ... from the database, and since it's set to encode them in latin1, it will returns \xe3\x83\x9c... to PHP. This explains how you can have mojibake in the database, but the PHP application still seems to work fine.

Of course, it would be better to have the database store the text in a readable format. For that you have to set the client encoding (see mysqli_set_charset) and the database column encoding (see MySQL documentation) to to utf8.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • Thanks Joni, helpful answer. Unfortunately I cannot change the charset in mysqli right now so means I have to get Java to convert the string to latin1 before storing in mysql so php reads it correctly – Peter Kelly Jun 15 '12 at 10:54