0

I have a website form written in Perl that saves user input in multiple languages to a MySQL database. While it has worked perfectly saving and displaying all characters without problems, in PHPMyAdmin the characters always displayed with errors. However I ignored this since the website was displaying characters OK.

Now I've just recently moved the website to a VPS and the database has seemingly enforced ut8mb4 encoding on the data, so it is now displaying character errors on the site. I'm not an expert and find the whole encoding area quite confusing. My question is, how can I:

a) determine how my data is actually encoded in my table?

b) convert it correctly to utf8mb4 so it displays correctly in PHPMyAdmin and my website?

All HTML pages use the charset=utf8 declaration. MySQL connection uses mysql_enable_utf8 => 1. The table in my original database was set to utf8_general_ci collation. The original database collation (I just noticed) was set to latin1_swedish_ci. The new database AND table collation is utf8mb4_general_ci. Thanks in advance.

Colin R. Turner
  • 1,323
  • 15
  • 24
  • 1
    Re "*[how can I] determine how my data is actually encoded in my table?*", It's encoded according to the encoding specified on the field. If you see garbage in phpMyAdmin, it's because you told the database to expect one encoding to be used, but used a different one. If you see correct data in phpMyAdmin, all's fine. – ikegami Apr 08 '20 at 13:25
  • 1
    If you have everything working with `utf8`, it's a very simple change because `utf8mb4` is a superset of `utf8`: Change the connection option from `mysql_enable_utf8 => 1` to `mysql_enable_utf8mb4 => 1`, and change the type of the fields using `ALTER TABLE t MODIFY col1 VARCHAR(50) CHARACTER SET utf8mb4;`. Note that what MySQL calls utf8mb4 is what the rest of the world calls UTF-8, so no other changes to the program are necessary. You are simply removing the assumption that characters >= U+10000 won't be provided. – ikegami Apr 08 '20 at 13:26

1 Answers1

0

SHOW CREATE TABLE will tell you the default CHARACTER SET for the table. For any column(s) that overrode the default, the column will specify what it is set to.

However, there could be garbage in the column. Many users have encountered this problem when they stored utf8 bytes into a latin1 column. This lead to "Mojobake" or "double encoding".

The only way to tell what is actually stored there is to SELECT HEX(col). Western European accented characters will be

  • one byte for a latin1 character stored in latin1 column.
  • 2 bytes for a utf8 character stored in 1 utf8 character or into 2 latin1 characters.
  • several bytes for "double encoding" when converted twice.

More discussion: Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222