What are the ramifications of storing UTF8 text in a Latin1 database?

Question

I have a mysql database in default charset latin1

mysql>  SELECT SCHEMA_NAME 'database', default_character_set_name 'charset', DEFAULT_COLLATION_NAME 'collation' FROM information_schema.SCHEMATA ;
+--------------------+---------+-------------------+
| database           | charset | collation         |
+--------------------+---------+-------------------+
| dbname12345        | latin1  | latin1_swedish_ci |

The driver, in this case knex + node-mysql, is configured to charset: 'utf8'. Text is therefore encoded, stored and retrieved by the driver in UTF8.

However, mysql thinks the data is in latin1. What are the ramifications of mixing charsets like this?

Short answer if you store UFT8 into latin1 you will get encoding problems. — Raymond Nijland, Jan 11 '18 at 16:31
Take the UTF8 string `asdf ®´` . If we encode that string with utf8 to ASCII bytes we get `asdf \xc2\xae\xc2\xb4` which as hex bytes is `6173646620c2aec2b4`. If we encode the string using latin1 we get `asdf \xae\xb4` ASCII bytes or `6173646620aeb4` hex bytes. Mysql simply stores and retrieves these bytes. If our driver knows how to encode and decode these strings to these bytes, why does it matter what the mysql charset is? — david_adler, Jan 11 '18 at 16:59
@david_adler ["Unicode is a superset of ASCII"](https://stackoverflow.com/questions/19212306/whats-the-difference-between-ascii-and-unicode) so there are characters that can be stored in unicode, that cannot be stored in latin1. In this case, take a non-latin character (mandarin for example) and it wouldn't be possible to store it in a latin1 encoding. — flip, Jan 11 '18 at 17:57

score 3 · Answer 1 · answered Jan 11 '18 at 21:48

Short answer to the title question is that it is OK to put the 256 characters that are common to both latin1 and utf8 into either CHARACTER SET for a column. However, you must be clear as to what encoding you are using. Otherwise ® might display as Â® ("Mojibake").

No, that SELECT fetches the default for any new tables in that database. It does not control how the columns are stored.

The database has a default for new tables.
The table has a default for new columns.
The column has the true definition of the CHARACTER SET.

So, do SHOW CREATE TABLE and look at the columns. If a column don't specify a charset, then look at default for the table, which is at the end of the output. (There is also a way to get this info from information_schema.COLUMNS, but that is clunkier.)

® is hex AE in latin1 or C2AE in utf8 (or utf8mb4). That character does not exist in the "ascii" character set, which stops at 7 bits.

However, since ® exists in both latin1 and utf8, you can safely go back and forth between the two encodings. That is, IF you tell MySQL the correct stuff.

The encoding in the client is specified in SET NAMES or the connection parameters. If the client has AE, the you must specify latin1; if the client has C2AE, you must specify utf8.

Meanwhile, the column (not the table, nor the database) can be either latin1 or utf8. The conversion, if needed, will be done as you INSERT and SELECT.

Caution: latin1 has only 256 different encodings, no Chinese, no Emoji, virtually nothing except Western European characters.

Going forward, it is best to define most columns utf8mb4. Otherwise, a pile-of-poo () might be displayed ????.

If you get question marks, Mojibake, etc, consult Trouble with UTF-8 characters; what I see is not what I stored

What are the ramifications of storing UTF8 text in a Latin1 database?

1 Answers1