MySQL Collation For English, Polish and German

Question

In my codeigniter project I am using MySQL as database. It's collation is 'latin1_swedish_ci'. Now I need to scale my website to store 'Polish', 'German', 'French', 'Ukrainian', 'Dutch' in addition to 'English'. But I don't know which collation to be used. I found different answers for different language in web. But I need a general one. Please help me to find out a solution.

Álvaro González · Answer 1 · 2019-02-10T10:18:38.377

Before caring about collation, you need to migrate to a Unicode compatible encoding first. As the name suggest, Latin-1* is designed for Latin script and cannot encode all Polish characters and, of course, none of the Cyrillic script. The obvious choice in 2019 is UTF-8, which corresponds to utf8mb4 in MySQL terminology.

Beware though that this may not be trivial. If your application assumes single byte encodings any text manipulation feature may need to be reviewed and maybe fixed. For instance, the € symbol has a length of 1 byte in Windows-1252 but it's 3 bytes in UTF-8. Let's say you have code that strips it from a string like '29.92€'. If your application removes the last byte, code that was working flawlessly in a single-byte encoding will no longer be valid in a multi-byte encoding because one byte isn't one character any more. Or, even in MySQL itself, some as simple as regular expressions wasn't multibyte safe until MySQL 8.0.4.

Once you address this, you need to pick a proper collation. Since you're mixing languages you need a general purpose Unicode one. Here's a good overview.

(*) MySQL is actually lying to you. When it says Latin-1 it actually means Windows-1252.

Sorry, I couldn't understand this line "If your application assumes single byte encodings any text manipulation feature many need to be reviewed and maybe fixed" — Rahul K R, Feb 08 '19 at 12:20
@ÁlvaroGonzález - (2 problems: syntax and byte vs char: `SELECT substr('29.92€', -1)` -> `€`. I was running with utf8mb4.) — Rick James, Feb 08 '19 at 22:22
@RickJames I've edited my answer again. I had somehow assumed the OP is using PHP, which of course doesn't need to be the case. — Álvaro González, Feb 10 '19 at 10:19

score 1 · Answer 2 · answered Feb 08 '19 at 17:32

1

(Alvaro's answer is good; I am adding some notes.)

If you are using MySQL 5.5 or 5.6 and have VARCHAR(255), see this for some issues you might run into.

ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4;

(for each table) is probably the simplest way to convert to UTF-8. Caution: test it separately from production, and test that the Western European text does not get mangled. If you get gibberish or question marks, see this

In converting to CHARACTER SET utf8mb4, the preferred COLLATION is utf8mb4_unicode_520_ci. (With MySQL 8.0, there is a better one.)

utf8mb4 will let you handle all the languages of the world, so this should be the last 'conversion' necessary.

answered Feb 08 '19 at 17:32

Rick James

135,179
13
127
222

"should be the last 'conversion' necessary". Until they add some trillion emojis more. – Álvaro González Feb 08 '19 at 17:39
@ÁlvaroGonzález - Hehe. Meanwhile, I cringe at the thought of how much disk space is chewed up by the code and data needed to render the current number of UTF-8 symbols, especially with "combining", "non-spacing", "coloring", etc modifiers. I lived through Y2K; none of us will live until Y9999, but most of us will live through Y2038. However, _none_ of my code will be in production still in 2038. – Rick James Feb 08 '19 at 22:13

MySQL Collation For English, Polish and German

2 Answers2