0

Please explain this situation: I have a mySQL database which is set to have its connection as latin1 and the character set of tables and columns to latin1. I send it UTF-8 encoded data (e.g. from a web form which encodes as UTF-8 via PHP). Later I retrieve that data and display it on a web page set to use UTF-8 encoding.

Will I see what I put in? I know I will for ASCI but how about for e.g. ü - a German umluat. And for a more exotic non Latin1 character?

In fact my simple test of sending UTF-8 encoded Japanese characters to the database from WordPress and then viewing them on a webpage suggests that there are no problems. I suspect that there is no conversion and that the database just stores the bytes it gets. But, if this is the case what is the significance of setting a character set? It is not a collation, which (I think) is to do with sorting.

Thank you

Kropotkin
  • 353
  • 3
  • 10
  • You set up a [mojibake](https://en.wikipedia.org/wiki/Mojibake)ry. – tadman May 17 '23 at 21:55
  • 1
    While writing your UTF-8 strings to a latin1 column over a latin1 connection, no character set conversion will occur. You will get out what you put in. If you try to act on those values, MySQL will have issues, as it expects each byte to represent a valid latin1 character. For example, `WHERE col LIKE 'a%'` would return `こんにちは世界`, as its first byte is `e3 (ã)`. Save yourself the future pain and store your multibyte text in utf8(mb4). Something will go wrong! Someone will manage to corrupt the content by running an `UPDATE` over a utf8(mb4) connection, or something similar. – user1191247 May 19 '23 at 12:50
  • @user1191247 - thanks. That seems to be the correct answer. – Kropotkin Jun 28 '23 at 15:09

2 Answers2

1

The database will attempt to convert your latin1 connection data to its internal encoding (typically utf8mb4, which is determined during installation). This process is likely to corrupt your string.

Even if the conversion were successful, searching or ordering this column would not be possible. If no alternatives are available, it would be preferable to store your UTF-8 string as a varbinary type.

Guillaume F.
  • 5,905
  • 2
  • 31
  • 59
  • 1
    Hi Guillaume. Thanks. Ok. That sounds right. But can you explain how I can send data in UTF-8 encoding (Japanese characters) and then display it on a web page correctly? The implication is something is converting it back to UTF-8 on the way out - but what would that be? – Kropotkin May 17 '23 at 21:39
  • It depends on your programming language and specificity of your web application; you should open a new question since this is a different topic. Note that if you use a `varbinary` column, the driver won't convert your string; it will remain as-this, and get returned as-this. – Guillaume F. May 17 '23 at 22:16
1

See Trouble with UTF-8 characters; what I see is not what I stored

latin1 can handle umlaut-u and other Western European characters. But you must tell MySQL that the client is talking latin1. And tell declare the columns to be utf8mb4. (Or whatever combination you have.)

Latin1 cannot handle any Asian character set.

Rick James
  • 135,179
  • 13
  • 127
  • 222