1

I have a simple (custom) CMS accepting markdown and displaying it in in a web page. Works fine in php5.6 (using the ondrej/php5 ppa on ubuntu 15.10). Mysql collation set to utf8 everywhere.

Upgrade the server to php7.0 (ondrej/php) and it displays garbage characters. I tried migrating the relevant mysql tables and fields to utf8mb4 / utf8mb4_unicode_ci with no luck.

Downgrade to php5.6 and it all works fine. I have a hunch it is some strange php setting I don't know about? php.ini default_collation=UTF-8. Couldn't find anything else that worked. phpMyAdmin shows garbage no matter what version of php or server settings, so it is not much help. What could i try next?

Source text (copied from php5.6 rendered page)

아동 보호 정책에 대한 규정
This Code is part of the

Rendered output (from php7 and phpMyAdmin)

ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì •
This Code is part of the
Steve
  • 3,601
  • 4
  • 34
  • 41
  • Which PHP MySQL library are you using and do you use the exact same script? (If it's the same script, does it by any chance have support for two different libraries and configure them differently?) – Julie Pelletier May 30 '16 at 05:39
  • 1
    If phpmyadmin shows garbage, that means you actually have garbage in your database because so far you have mishandled encodings badly. See [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/) for some background explanation. – deceze May 30 '16 at 05:52
  • @deceze it looks like you are right. I think i have garbage in the table. I tried the test code and it worked perfectly. I'm not quite seeing my mistake yet as I think i have followed all the (important) instructions there. So: how do i determine what charset i actually have in the fields? It may be possible to convert to correct utf8? – Steve May 30 '16 at 09:02
  • for those who follow: The problem in my case was my PDO database connection did not have charset=utf8 because i _thought_ that setting character_set_* to utf8 in my.cnf would take care of that. In php5.6 it does not. In php7 it does. http://stackoverflow.com/questions/2477452/%C3%A2%E2%82%AC-showing-on-page-instead-of/2477480#2477480 has a fix: run this on every column that can hold text (char, varchar, text etc) – Steve Jun 01 '16 at 23:57

1 Answers1

1

Use this to change a table to utf8mb4:

ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;

However, if the table was already messed up, then this won't fix it. Do the following to verify:

SELECT col, HEX(col) FROM tbl WHERE ...

For example, 아동 보호 정책에 대한 규정 will show a hex of EC9584 EB8F99 EBB3B4 ED98B8 ECA095 ECB185 EC9790 EB8C80 ED959C EAB79C ECA095. (Please ignore the spaces.)

For Korean text, you should see (mostly) groups of 3 hex bytes of the form Ewxxyy, where w is A or B or C or D, as shown in the example above. Hex 20 (only 1 byte) represents a space.

ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì • is the Mojibake for it. This implies that somewhere latin1 was erroneously involved, probably when you INSERTed the text. In that case, you will see something like C3AC E280A2 E2809E C3AB C28F E284A2 C3AB C2B3 C2B4 C3AD CB9C C2B8 ... -- mostly 2-byte Cwxx hex.

If you see that, an UPDATE of something like this will repair the data: CONVERT(BINARY(CONVERT(CONVERT(col USING utf8mb4) USING latin1)) USING utf8mb4) (Edit: removed call to UNHEX.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for the detailed answer: this is very close to what i actually did a few days ago after much googling and head scratching (see the notes in the comments on the question). The sql i used was column_name = convert(cast(convert(column_name using latin1) as binary) using utf8) from http://stackoverflow.com/questions/9407834/mysql-convert-latin1-characters-on-a-utf8-table-into-utf8 because my columns were in utf8 not utf8mb4. Thanks again for the excellent focussed answer – Steve Jun 05 '16 at 23:12
  • Oops, I had a spurious `UNHEX()` call. – Rick James Jun 05 '16 at 23:23
  • 1
    There are many variants on the expression. I suspect both have identical effect. – Rick James Jun 05 '16 at 23:25