MYSQL database encoding, mix of latin1 and utf-8

Question

I have an older MYSQL 5.6.34 database that was created several years ago (not by me). I've taken one of the databases and started building tables in it for use in my applications when I noticed some of the other tables have a latin1 encoding. Then I noticed all the databases, with the exception of the information_schema all use latin1.

SCHEMA_NAME         DEFAULT_CHARACTER_SET_NAME  DEFAULT_COLLATION_NAME
information_schema  utf8                        utf8_general_ci
443347_pxa          latin1                      latin1_swedish_ci
443348_srp          latin1                      latin1_swedish_ci
warehouseinventory  latin1                      latin1_swedish_ci

I'm not sure why the original developer used latin1 for encoding and collation. There's no reason why really.

I'm don't really want to change any encoding on any existing databases, I just want to create new tables with utf8mb4. But the more I think about it, to stay consistent I'd maybe want to change them. The last database I created was warehouseinventory but I didn't notice the encoding, and I assume it defaulted to latin1 based on the encoding of the previously created databases.

Is there any reason why I'd leave these at latin1 and would it hurt to change them? One of the databases does have spanish language stored in it, but I always thought that was just UTF.

You have backups of all your data, right? You can pull down a snapshot, if necessary, to experiment on it before doing the changes live? If you do the changes live you can roll back if things go *seriously wrong*? Ideally you have everything UTF-8, or `utf8mb4` in MySQL parlance, unless you have serious limitations. MySQL 5.6 handles UTF-8 well, MySQL 5.7 even better, so if you can upgrade consider that too. — tadman, Jun 02 '18 at 02:42
Good stuff! Too many people are like "Um, backups?" and then I die a little inside. At least this isn't one of those "I accidentally the database with no backups" questions. Go UTF-8 all the way, life will be way better. Mixed encoding can be a nightmare for your application layer to deal with. This is how "Itâ€™s" happens. — tadman, Jun 02 '18 at 02:43
Utf8_general in MySQL is not what you are probably assuming it is. It is not a full Unicode implementation, and it does not affect what you can and cannot store in the table, really. It's a collation, which affects character comparisons and sorting rules. — dodexahedron, Jun 02 '18 at 02:44
What could go wrong if I start changing encoding? I've never had to deal with this before — ItsPronounced, Jun 02 '18 at 02:44
You may find your code breaks because of wild assumptions about the encoding in the original database, but you can shake those out with aggressive testing. Make sure you have a record in each field of each table with Latin1 special characters like "Tést¥" to see if that comes back out okay after conversion. Sometimes your application has been storing UTF-8 *as* Latin1, ignorant of the difference, and the database doesn't care, so it's actually garbled. — tadman, Jun 02 '18 at 02:46
@tadman - `Tést¥` has characters that have encodings in latin1. So, it is not a good test for when latin1 is insufficient. — Rick James, Jun 03 '18 at 02:17
@RickJames It's not a test for when it's insufficient, but for when you have Latin1 data that you're trying to convert to UTF-8. For full UTF-8 fields I'd test with "Tés† ☃☃️" to be thorough.. — tadman, Jun 03 '18 at 03:01
@tadman - The expected hex (with spacing added) is `54 C3A9 73 E280A0 E29883 E29883 EFB88F F09FA4A1 F09F918B F09F8FBD` — Rick James, Jan 15 '20 at 16:36

Rick James · Accepted Answer · 2018-06-03T02:20:51.917

If you have any indexes on column(s) that are declared VARCHAR(255), you could get an error about "index bigger than 767 bytes" when converting to utf8mb4. There are 5 workarounds .

If you don't have any varchars between 191 and 255 characters that are indexed, then

ALTER TABLE tbl CONVERT TO utf8mb4;

will convert all columns in the table tbl to utf8mb4.

If you happen to have incorrectly stored utf8 bytes into latin1 in "double encoding", you need a different fix. Review Fixes for various Cases .

The "default" character set or collation is just that -- a default. That is, when a new column or table is created, it takes on the default. If you explicitly specify charset and/or collation for a column, that overrides the default for the table.

There's an old saying: "If it ain't broke, don't fix it." I lean toward making new databases/tables/columns utf8mb4, but leaving the old ones alone. Note: Having a mixture is OK. The connection specifies what encoding the client uses. MySQL, during INSERT or SELECT will convert from/to the client's settings to/from the column's settings.

Spanish has a limited set of characters. That set is a subset of latin1, utf8, and utf8mb4, so any of those charsets work with Spanish. However, the encoding is different. For example, ñ is

Hex F1   in latin1
Hex C3B1 in utf8 and utf8mb4

Similarly for the rest of Western Europe. Asia needs at least utf8. Emoji and some Chinese characters need utf8mb4.

If you end up with question marks, Mojibake, etc, debug it here .

Terminology: Outside MySQL: UTF-8; Inside MySQL: utf8mb4. Those are essentially the same. MySQL's utf8 is a subset of them.

Tést¥, in hex:

54 E9 73 74 A5 -- cp1256, dec8, latin1, latin5  encoding
54 C3A9 73 74 C2A5 -- utf8, utf8mb4  encoding
54 C383 C2A9 73 74 C382 C2A5 -- "double encoded"; may show as TÃ©stÂ¥

Upgrading to a newer MySQL 5.7 side-steps the index issue, that's the TL;DR of your link. — tadman, Jun 03 '18 at 03:02

MYSQL database encoding, mix of latin1 and utf-8

1 Answers1