1

I need to convert latin1 charset of a table to utf8.

Quoting from mysql docs:

The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8mb4). In this case, you have to do the following for each such column:

ALTER TABLE t1 CHANGE c1 c1 BLOB;

ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8mb4;

This answer shows how to find out charset at DB level, table level, and column level. But I need to find out the charset of the actual stored values. How can I do that?

Since my connector/j jdbc connection string doesn't specify any characterEncoding or connectionCollation properties, it is possible that it used utf8 by default to store the values, in which case I don't need any conversion, just change the table metadata.

mysql-connector-java version: 8.0.22

mysql database version: 5.6

spring boot version: 2.5.x

Kartik
  • 7,677
  • 4
  • 28
  • 50

2 Answers2

0

The character set of the string in a given column should be the same as the column definition.

There have been cases where people accidentally store the bytes of the wrong encoding in a column. For example, they store bytes of a latin1 encoding in a utf8 field. This is a terrible idea, because queries can't tell the difference. Those bytes may not be valid values of the column's defined encoding, and this results in garbage data. Cleaning up a table where some of the strings are stored in the wrong encoding is an unpleasant chore.

So I strongly urge you to store only strings encoded in a compatible way according to the column's definition, and to assume that all strings are stored this way.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Thanks. So, does spring-boot check the column's charset before encoding/saving values? Or because the jdbc connection string doesn't specify characterEncoding, it would convert to utf8 by default? If that's the case, then we've been storing utf8 values in latin1 table? Doesn't sound right. – Kartik Feb 20 '23 at 01:19
  • I don't know Spring Boot, but any MySQL session will convert character encodings if there is a conversion. I mean, not all characters in one encoding necessarily have a match in another encoding. But if you only use common characters that exist in both, MySQL will try to convert them as you insert. – Bill Karwin Feb 20 '23 at 02:48
  • But the text I quoted from docs say "_if you have a column in one character set (like latin1) but the stored values actually use some other_". How could this be possible, if mysql converts as you insert? In other words, how can I ask mysql to store a utf8 field in a latin1 column? – Kartik Feb 20 '23 at 03:37
  • It doesn't happen often. You'd basically have to cast a string to binary then cast those binary bytes to the wrong character set. You'd think this would never happen because it takes deliberate effort, but it's surprising how many questions we see here on Stack Overflow from people who have mangled their encodings, and are asking for help to recover the original data. – Bill Karwin Feb 20 '23 at 03:43
  • A quick search finds a couple of examples: https://stackoverflow.com/questions/37350013/mysql-find-and-fix-incorrect-characters and https://stackoverflow.com/questions/9304485/how-to-detect-utf-8-characters-in-a-latin1-encoded-column-mysql – Bill Karwin Feb 20 '23 at 03:47
0

To answer the title:

SHOW CREATE TABLE tablename shows the detault charset for the table and any overrides for individual columns.

Don't blindly use CONVERT TO, especially the 2-step ALTER you are showing. Let's see what is in the table now (SELECT col, HEX(col) ... for something with accented text.

See Trouble with UTF-8 characters; what I see is not what I stored for the main 4 types of problems.

This gives several cases and how to fix them. http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases One case involves using CONVERT TO; two other cases involve using BLOB or VARBINARY.

Rick James
  • 135,179
  • 13
  • 127
  • 222