0

My company faced an unpleasant issue when we upgraded our MySQL DB hosted in AWS RDS from version 5.7 to 8.0.

Initially, we had MySQL 5.7, where the previous dev team left the charsets and collations default. Charset was latin1, and the collation was latin1_swedish_ci. This database is used by PHP and Node.JS apps to store the data without any encoding transformations. Both worked fine, storing the data containing non-Latin characters, such as currency symbols and non-Latin languages.

The problems began when we upgraded the version of the MySQL engine to 8.0 using AWS blue/green update. The data remains the same, but all non-Latin characters start to display as question marks (????) or gibberish. Changing the charset to utf8mb4 and collation to utf8mb4_900_ai_ci doesn't help. The data seems unchanged and returned gibberish even when issuing the command SET NAMES utf8 or SET NAMES utf8mb4.

Worse than that, after converting tables to utf8, the data that comes to clients' DB seems corrupted (question marks and gibberish again). And previously saved data also become unreadable. The only thing that seems to help is issuing the command SET NAMES latin1. After that, we get most of the data, but some chars, like the euro symbol (€) and other chars from different languages, don't store correctly in the DB. The strange thing is that €-was stored in MySQL 5.7 before without problems and the SET NAMES ... command.

I read similar topics describing how to convert table charset and collation, but the problem is that we can convert tables using SQL statements, but the data remains unconverted. So it returns gibberish even after conversion.

Also, I want to mention that during our attempts to convert the data using MySQL CONVERT TO CHARACTER SET commands, we got the situation that now part of the data comes to the DB using utf8 and then can't be converted back to latin1 (to perform conversion via BLOB fields). It now returns an Illegal character \xXX error. Does anybody know how to convert those values not bringing the DB back to latin1?

Update: This is the result of SELECT id, format, HEX(format) FROM stores: Query results The table already have charset utf8mb4 (we run ALTER TABLE stores CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci). But as you can see the data isn't converted. All rows should have the €{{amount_with_comma_separator}} value, but only one has it correctly. The problem is that after conversion of the table to utf8mb4, the data retuned the same regardless of the SET NAMES [utf8|latin1]

I will appreciate any help.

  • Does this answer your question? [How to change the CHARACTER SET (and COLLATION) throughout a database?](https://stackoverflow.com/questions/5906585/how-to-change-the-character-set-and-collation-throughout-a-database) – JosefZ May 01 '23 at 19:50
  • @JosefZ no, the problem is that the data itself remains unconverted. Only tables metadata. – Vitaly Radchik May 02 '23 at 06:21
  • See "question marks" (not "black diamond") in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored If you want to discuss further, please provide the SHOWs and SELECT HEX(..) to assist in analysis. – Rick James May 02 '23 at 18:45
  • @RickJames thanks, added output of `SELECT HEX()` to the original post. – Vitaly Radchik May 03 '23 at 11:02

1 Answers1

1

Bad news.

c3a2e2809ac2ac  is the "double encoding" for Euro.
e282ac  (in line 811) is the correct UTF-8 encoding for Euro

Usually, the client messed up, but sometimes it did it correctly.

If you are seeing "????", that is another issue.

So (not a complete answer yet)...

  • Fix the code, and at the same time, fix the table definitions. This requires special code to handle 811 one way and the others another way.
  • Fix the client to not mangle things while inserting.
  • Fix the client so it does not see ????. Usually question marks are inserted into the table. Are you seeing them for the text you have here? (I think not.)

Do read Trouble with UTF-8 characters; what I see is not what I stored

Do locate the SET NAMES, conversion code, SHOW CREATE TABLE, connection parameters, etc.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Yeah, thanks. The problem is that first, we did is run the command `ALTER TABLE table_name CONVERT TO utf8mb4 COLLATE utf8mb4_general_ci`, and only after that realize that we could read the old data only if put `SET NAMES latin1`. But we couldn't convert the table back to `latin1` because wrong chars were written to the tables. So now we have `SET NAMES latin1` in the connection and table encoding `utf8mb4`. – Vitaly Radchik May 06 '23 at 08:11
  • Alas, applying the "wrong" fix may make things worse. – Rick James May 06 '23 at 14:38