0

Although I set up my web app to be fully compliant with UTF8, I forgot to specify the UTF-8 charset in the PDO settings... It happens that the Chinese characters looks strange in the database/phpmyadmin (你好 for 你好), although they look fine in my web app.

As I finally know the solution (to add the charset in the PDO connection settings), I would like to know if there is an easy way to convert my old data to a proper UTF-8. I found on Stackoverflow the following code that does the job, but it solves the encoding problem column per column, table per table..

UPDATE tablename SET
field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8);

Does it exist a way to perform such conversion to the whole database?

I precise that my MySQL databases/tables are already configured to use the charset UTF-8.

Thanks!

Pierre
  • 1,044
  • 15
  • 27
  • First, let's see `SHOW CREATE TABLE` and `SELECT col, HEX(col) ...` for some mangled text. That is needed to determine whether it is Mojibake or Double Encoding (which is messier to fix). See [_this_](http://stackoverflow.com/a/38363567/1766831) for some discussion. – Rick James Oct 01 '16 at 05:22
  • And, since you are using Chinese, you will need utf8mb4, not just utf8. – Rick James Oct 01 '16 at 05:23
  • Hello Rick, thanks a lot for your feedback. Here are the results I get: With SHOW CREATE TABLE, I have DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci at the end of my table structure. And for HEX(col), I get long strings starting with C, that seems to be for Western Europe. – Pierre Oct 01 '16 at 06:26
  • Does the _column_ override the charset and collation? Instead of having a long string; let's see a substring. – Rick James Oct 01 '16 at 14:40
  • 1
    `你好` is hex `E4BDA0E5A5BD` if correctly stored as utf8; it would be hex `C3A4C2BDC2A0C3A5C2A5C2BD` if "double encoded". Before attempting to fix the data, we need to distinguish the two cases. – Rick James Oct 01 '16 at 14:43
  • Thanks a lot Rick for your help. All my columns are set with utf8_unicode_ci, and I confirm my data are double encoded because I get the exact same hex value 'C3A4C2BDC2A0C3A5C2A5C2BD' for 你好. – Pierre Oct 01 '16 at 15:50

1 Answers1

1

You have "double encoding". I believe these steps 'caused' it:

  1. The bytes (for, say, 你好) to be stored were correctly encoded utf8 (E4BDA0 E5A5BD).
  2. The CHARACTER SET for the column was correctly utf8 (or utf8mb4). (It could have been latin1, with same outcome.)
  3. The "connection" was set to latin1 when you INSERTed. This is where the error was. That declared that the data being stored was encoded as latin1, these 6 characters: hex E4 BD A0 E5 A5 BD. During the storing, they were converted to the columns charset (utf8) to become C3A4 C2BD C2A0 C3A5 C2A5 C2BD.
  4. You correctly SELECTed the data as utf8. But, by now, that gave you this mess: 你好.

Two things are wrong; both need fixing simultaneously:

  • Step 2 needs changing to establish that UTF-8 is the client's encoding. See PHP. Do something like $db = new PDO('dblib:host=host;dbname=db;charset=UTF8', $user, $pwd); for PDO.
  • The data needs fixing. Your UPDATE looks good for this case.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • For other conversion cases, http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases – Rick James Oct 02 '16 at 04:39
  • For me, setting `charset=UTF8` wasn´t enough. I had to specify the TDS version, like this: `'dblib:version=7.0;charset=UTF-8;host=host;dbname=db'` – StackUnder Sep 28 '18 at 12:32
  • @StackUnder - I think the use of the dash is wrong in that context. That is, `utf8` is correct, `utf-8` is not. – Rick James Sep 29 '18 at 23:34
  • @StackUnder - You are not associated with Pierre, correct? Please start a new Question, and provide all the details of _your_ situation. Please include the Hex, `SHOW CREATE TABLE`, `SHOW VARIABLES LIKE char%`, and anything else discussed in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored . There is likely to be some subtle difference between your situation ahd Pierre's. – Rick James Oct 01 '18 at 20:02