utf8mb4_unicode_ci OR utf8mb4_danish_ci

Question

Both utf8mb4_unicode_ci and utf8mb4_danish_ci (mysql/mariadb database) seems to work fine on my danish website.

When I test for characters/symbols/emojis, they both seem to display the same result.
Danish letters like æ/ø/å works in both.

Why would I choose one charset/collation over the other?

Does utf8mb4_danish_ci add characters/codes or is it actually more narrow than utf8mb4_unicode_ci?

The charset is `utf8mb4`. `utf8mb4_danish_ci` is the collation which "is a set of rules for comparing characters in a character set". — user3783243, Jul 18 '18 at 03:52
So why choose utf8mb4_unicode_ci over utf8mb4_danish_ci or reversed? — mowgli, Jul 18 '18 at 04:04
It is for comparison, not storage. `characters/symbols/emojis` will not be an issue, nor `characters/codes`. They both support the same character set. https://stackoverflow.com/questions/367711/what-is-the-best-collation-to-use-for-mysql-with-php — user3783243, Jul 18 '18 at 04:12
Ok. I doubt I will be comparing utf8 characters/codes, only iso characters. The utf8mb4 is just to give users the ability to use emoji's etc. in a text field on my website. So I guess it's more a matter of performance? (in my case) — mowgli, Jul 18 '18 at 04:20
Yup, or my understanding of it at least. I'll leave it to someone with more experience in case there are nuisances I'm missing though. — user3783243, Jul 18 '18 at 04:25
part of SHOW COLLATION WHERE Charset = 'utf8mb4'; result:utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | PAD SPACE | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | they are same — yaoweijq, Jul 18 '18 at 08:55
An example I like is Spanish: `ch` used to be a letter of its own so dictionaries looked like `canada corea china`. The rule changed and it was split into c and h so dictionaries were updated to `canada china corea`. Same language, same characters, same encoding but two different collations to choose from ("traditional" and "modern"). — Álvaro González, Jul 18 '18 at 18:50
@yaoweijq - The meat is behind the column with `235` vs `224`; the collations _are_ different. And especially different in the 3 'characters' the OP mentioned. — Rick James, Jul 19 '18 at 15:42
@mowgli - "Encoding" is one thing; "Collation" is another. `ch` is _encoded_ the same in all `CHARACTER SETs`, but sorts different in `utf8_spanish2_ci` (and czech, lithuanian, and slovak collations). In utf8_lithuanian_ci, `c=ch`. More: http://mysql.rjweb.org/utf8_collations.html — Rick James, Jul 19 '18 at 15:49
(For this European languages, character sets `utf8` and `utf8mb4` are identical, except for name.) — Rick James, Jul 19 '18 at 15:51
@RickJames 235 vs 224 is only the identifier, i think actually they are the same, as other columns values are all the same. — yaoweijq, Jul 20 '18 at 06:12
@yaoweijq - But all the collation information is behind that identifier, in some file or code somewhere. The _encoding_ is the same (both are utf8mb4); the _collation_ is different. — Rick James, Jul 20 '18 at 14:18

Rick James · Answer 1 · 2020-01-03T16:59:14.593

To distinguish and sort Emoji, I think you need utf8mb4_unicode_520_ci, which is based on Unicode 5.20. utf8mb4_unicode_ci is based on the older 4.0. (I don't know _danish_ compares Emoji.) MySQL 8.0 will introduce the 9.0 version: utf8mb4_0900_ai_ci.

For utf8mb4_danish_ci: æ/ø/å each sort after z. That is y < z < æ < ø < å

For utf8mb4_unicode_ci, æ acts like a letter between a and b'; ø between o and p; å=a

For utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci: æ=ae; ø=o; å=a

It all depends on whether you care about how those three characters compare (WHERE) and sort (ORDER BY) as above.

More details and more collations: http://mysql.rjweb.org/utf8mb4_collations.html

You did not ask, but it seems that danish says Ä=Æ=ä=æ < Ö=Ø=ö=ø < Aa=Å=å < Þ=þ; those seem to cater toward Icelandic and Swedish.

utf8mb4_unicode_ci OR utf8mb4_danish_ci

1 Answers1